Bug 108606 - Raven Ridge: constant lockups since latest pull from Linus
Summary: Raven Ridge: constant lockups since latest pull from Linus
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-30 21:41 UTC by Samantha McVey
Modified: 2019-11-19 09:01 UTC (History)
2 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg log (98.56 KB, text/plain)
2018-10-30 21:41 UTC, Samantha McVey
no flags Details
Xorg log (37.66 KB, text/plain)
2018-10-30 21:45 UTC, Samantha McVey
no flags Details
possible fix (1.47 KB, patch)
2018-11-01 02:44 UTC, Alex Deucher
no flags Details | Splinter Review
amdgpu.ppfeaturemask=0xffff3fff resume from suspend lockup (74.70 KB, text/plain)
2018-11-01 08:06 UTC, Samantha McVey
no flags Details
amdgpu.ppfeaturemask=0xfffdbfff lockup during normal usage (99.89 KB, text/plain)
2018-11-01 17:22 UTC, Samantha McVey
no flags Details
dmesg 4.20-rc3 freeze on firmware 18.40 (321.62 KB, text/plain)
2018-11-20 14:48 UTC, Marvin Damschen
no flags Details

Description Samantha McVey 2018-10-30 21:41:46 UTC
Created attachment 142287 [details]
dmesg log

My system was stable on Linus's branch until he pulled in a bunch of DRM changes with this merge commit 53b3b6bbfde6. This pull causes the system to freeze just by using the desktop.

Oct 30 15:31:11 kernel: [drm:gfx_v9_0_priv_reg_irq [amdgpu]] *ERROR* Illegal register access in command stream
Oct 30 15:31:11 kernel: [drm] GPU recovery disabled.
Oct 30 15:31:22 kernel: [drm:amdgpu_job_timedout [amdgpu]] *ERROR* ring gfx timeout, signaled seq=61066, emitted seq=61068
Oct 30 15:31:22 kernel: [drm] GPU recovery disabled.


Full dmesg attached.
Comment 1 Samantha McVey 2018-10-30 21:45:30 UTC
Created attachment 142288 [details]
Xorg log

My system is a Lenovo A485 with the AMD PRO 2700U. While the system would have freezes previously during certain games, it was almost always stable using KDE and browsing the web. So this seems to be a regression. Attaching xorg log to this post.
Comment 2 Alex Deucher 2018-10-30 21:50:45 UTC
Did you also update mesa or X?  Does reverting:
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf
help?
if so, can you try the latest firmware for you GPU?  If not, can you bisect?
Comment 3 Samantha McVey 2018-10-30 22:00:18 UTC
X and mesa were not upgraded. It is repeatable, where if I boot into an older kernel things work, and if I boot into the latest Linus master my system will freeze within the next 30mins on average. Firmware is the latest from linux-firmware.git (thus it includes the recent Raven firmware bump).

I will revert that commit and see if I still get the GPU lockup.
Comment 4 Samantha McVey 2018-10-31 02:44:45 UTC
Alex,
I can report that reverting commit a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf fixes the GPU freezes I was getting.
Comment 5 Alex Deucher 2018-10-31 19:07:02 UTC
Does the latest firmware from https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git
fix the issue as well?
Comment 6 Samantha McVey 2018-10-31 19:38:37 UTC
Yeah I have the latest linux-firmware installed. I also have raven_dmcu.bin which just was added so I'm certain I have the latest, since I installed the git version 3 days ago (when I read that Raven Ridge had its firmware updated).
Comment 7 Alex Deucher 2018-10-31 19:50:31 UTC
(In reply to Samantha McVey from comment #6)
> Yeah I have the latest linux-firmware installed. I also have raven_dmcu.bin
> which just was added so I'm certain I have the latest, since I installed the
> git version 3 days ago (when I read that Raven Ridge had its firmware
> updated).

So to confirm, the newest firmware did not help?  Please provide the output of /sys/kernel/debug/dri/0/amdgpu_firmware_info
Comment 8 Samantha McVey 2018-10-31 19:52:35 UTC
On the newest firmware, the issue is present, yes.

Output of /sys/kernel/debug/dri/0/amdgpu_firmware_info

VCE feature version: 0, firmware version: 0x00000000
UVD feature version: 0, firmware version: 0x00000000
MC feature version: 0, firmware version: 0x00000000
ME feature version: 42, firmware version: 0x0000009c
PFP feature version: 42, firmware version: 0x000000b1
CE feature version: 42, firmware version: 0x0000004d
RLC feature version: 1, firmware version: 0x00000213
RLC SRLC feature version: 1, firmware version: 0x00000001
RLC SRLG feature version: 1, firmware version: 0x00000001
RLC SRLS feature version: 1, firmware version: 0x00000001
MEC feature version: 42, firmware version: 0x00000192
MEC2 feature version: 42, firmware version: 0x00000192
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x0017ba78
SMC feature version: 0, firmware version: 0x00001e44
SDMA0 feature version: 41, firmware version: 0x000000a8
VCN feature version: 0, firmware version: 0x01004912
DMCU feature version: 0, firmware version: 0x00000001
VBIOS version: 113-RAVEN-110
Comment 9 Alex Deucher 2018-11-01 02:44:12 UTC
Created attachment 142315 [details] [review]
possible fix

Does this patch fix the issue?
Comment 10 Alex Deucher 2018-11-01 02:53:50 UTC
Can you determine whether it's stutter or gfxoff that causes the problem?  Please try booting with amdgpu.ppfeaturemask=0xfffdbfff or amdgpu.ppfeaturemask=0xffff3fff on the kernel command line in grub and see which one fixes the issue.
Comment 11 Samantha McVey 2018-11-01 04:27:43 UTC
I am testing with amdgpu.ppfeaturemask=0xfffdbfff right now. I still have that commit reverted, is this test still valid? What I mean is I have amdgpu.ppfeaturemask=0xfffdbfff in my cmdline at the same time as commit a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf is reverted.
Comment 12 Alex Deucher 2018-11-01 05:06:28 UTC
(In reply to Samantha McVey from comment #11)
> I am testing with amdgpu.ppfeaturemask=0xfffdbfff right now. I still have
> that commit reverted, is this test still valid? What I mean is I have
> amdgpu.ppfeaturemask=0xfffdbfff in my cmdline at the same time as commit
> a06c3ee083b5c622bb9f4a687d7ab5265ee73dbf is reverted.

yes, that is fine.  That commit just changes the default value of the parameter.  If you specify it directly, you override the default.
Comment 13 Samantha McVey 2018-11-01 08:06:38 UTC
Created attachment 142319 [details]
amdgpu.ppfeaturemask=0xffff3fff resume from suspend lockup

Alex,
So far it seems that I am not getting the gpu lockup I had previously with either amdgpu.ppfeaturemask=0xfffdbfff or amdgpu.ppfeaturemask=0xffff3fff. I tested again and I still get the gpu lockup without these options (this is without having any commit reverted).

I will try your patch and test that now.

NOTE: While using amdgpu.ppfeaturemask=0xffff3fff I had a graphics lockup on resume from suspend, with messages I don't remember seeing before. May or may not be related to the current issue.

NOTE2: I didn't mention it before, but I have a systemd hook which disables C6 states prior to suspend and enables them again after suspend. The reason for this is it fixes GPU lockups I have gotten on resume. (May be relevant to this current bug or could be unrelated, I'm not sure how the state of the CPU affects the GPU…).
Comment 14 Samantha McVey 2018-11-01 08:21:14 UTC
Wanted to make sure a few things were clear from my last message:

"I tested again and I still get the gpu lockup without these options (this is without having any commit reverted)."
- Since I was not getting a lockup under either of the cmdline options, I went back and made sure I could still easily trigger the lockup with no command line options. I was able to trigger the lockup with no command line options.

The systemd hook:
- This hook is not related to any regression to my knowledge as I've used it since shortly after I got this laptop, so all of my testing has always had this suspend hook.
Comment 15 Samantha McVey 2018-11-01 16:53:01 UTC
Alex,
The conditions in your patch never get triggered. I added some print statements in there, and gfx_v9_0_init_rlc_ext_microcode(adev) runs, but `adev->powerplay.pp_feature &= ~PP_GFXOFF_MASK` never runs, because my rlc_firmware_version is 531 but rlc_feature_version is 1, so
`((adev->gfx.rlc_fw_version == 531) && (adev->gfx.rlc_feature_version < 1)) {` is false.
Comment 16 Samantha McVey 2018-11-01 17:22:44 UTC
Created attachment 142330 [details]
amdgpu.ppfeaturemask=0xfffdbfff lockup during normal usage

I did more testing on amdgpu.ppfeaturemask=0xfffdbfff, and I just got the gpu lockup while browsing the web. I am going to do further testing of the other command line option to see if that will trigger any freezes during more extended usage. But it looks like 0xfffdbfff is confirmed as triggering the bug.
Comment 17 Marvin Damschen 2018-11-20 14:46:45 UTC
I am not sure whether this is the same issue, but I get freezes within ~1 hour uptime with the latest firmware (18.40) on a Ryzen 2500U (Lenovo E485) with 4.20 rc3. I will attach a dmesg log of such a freeze.

I did not see freezes since weeks, but they reappeared after upgrading the firmware from 18.30 to 18.40. Moving between different kernel/mesa/xorg driver versions did not help. I am now back to the 18.30 firmware and did not see a freeze within a full work day.
Comment 18 Marvin Damschen 2018-11-20 14:48:26 UTC
Created attachment 142527 [details]
dmesg 4.20-rc3 freeze on firmware 18.40
Comment 19 Martin Peres 2019-11-19 09:01:21 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/579.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.