Bug 111528 - Using Fan-Control causes mmhub-pagefault and unresponsive system on Navi
Summary: Using Fan-Control causes mmhub-pagefault and unresponsive system on Navi
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: not set normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-01 08:08 UTC by Matthias Müller
Modified: 2019-11-19 09:50 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Description Matthias Müller 2019-09-01 08:08:50 UTC
I first thought my issue was related to https://bugs.freedesktop.org/show_bug.cgi?id=111481 , but it seems it is a different one.

When using any kind of fan-control software (i tried corectrl and radeon-profile), after a while i get a strange "stutterting", as if the whole OS halted for a few seconds, then continued for a few seconds...and the halted times grew while the "usable seconds" got shorter quickly to the point of a seemingly unresponsive system.
It's not just the GUI that is halted, but the whole system - i had rsync running one time and the HDD is audible enough to hear that it was only active during the seconds the GUI was responsive.

It doesn't happen regularly (seems like anything between 30min and 120min) and i haven't yet made out a direct cause, but in journalctl, it seems the same messages appear every time when it begins:

kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
kernel: amdgpu: [powerplay] Failed to send message 0xf, response 0xfffffffb, param 0xfd6000
 kernel: amdgpu 0000:0f:00.0: [mmhub] VMC page fault (src_id:0 ring:169 vmid:0 pasid:0)
 kernel: amdgpu 0000:0f:00.0:   at page 0x0000600000fd6000 from 18
 kernel: amdgpu 0000:0f:00.0: VM_L2_PROTECTION_FAULT_STATUS:0x00041152

after that there are a lot of these:

kernel: amdgpu: [powerplay] Failed to send message 0x40, response 0xffffffc2 param 0x2
kernel: amdgpu: [powerplay] Failed to send message 0xe, response 0xffffffc2, param 0x80

with some other amdgpu-errors sprinkled in until shutdown/hardreset.

It doesn't occur without a fan-control software, so i'm pretty certain it is somehow related to that.

System: 
Powercolor 5700xt Red Devil
3800x on X570 Taichi
Manjaro KDE
Manjaro 5.3rc6.d0826.ga55aa89-1
mesa-git 1:19.3.0_devel.114849.0142dcb990e-1
llvm-libs-git 10.0.0_r325376.70e158e09e9-1
And if it matters: firmware from https://aur.archlinux.org/packages/linux-firmware-agd5f-radeon-navi10/ v2019.08.26.14.36-1
Comment 1 Marko Popovic 2019-09-02 09:55:08 UTC
I can confirmed that using CoreCTRL did make my GPU quite unstable and sometimes it would just lag for a few seconds and other times it would completely crash the system until I rebooted.

The last time I observed this bug was when using Manjaro-gnome, Kernel 5.3 RC4 and MESA 19.3-git / LLVM10-git.

Haven't used CoreCTRL since because of the issue.
Comment 2 Martin Peres 2019-11-19 09:50:57 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/899.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.