Summary: | amdgpu coolers never stoping linux | ||
---|---|---|---|
Product: | DRI | Reporter: | Denis Denisov <denji0k> |
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> |
Status: | RESOLVED MOVED | QA Contact: | |
Severity: | major | ||
Priority: | high | CC: | bugs.freedesktop.org, fdsfgs, lriutzel |
Version: | unspecified | ||
Hardware: | All | ||
OS: | Linux (All) | ||
Whiteboard: | |||
i915 platform: | i915 features: |
Description
Denis Denisov
2017-04-12 18:21:00 UTC
You should try kernel 4.12. There is some progress with pwm1_enable: now we have not only "1" which is manual control but "0" (full speed) and "2" (this should be FW control, although it doesn’t look so for me: fans are still spinning and pwm1 reads 0 or 122-124 randomly). Still have to use userspace control (mode "1"). (In reply to Sergey Kochneff from comment #1) > You should try kernel 4.12. There is some progress with pwm1_enable: now we > have not only "1" which is manual control but "0" (full speed) and "2" (this > should be FW control, although it doesn’t look so for me: fans are still > spinning and pwm1 reads 0 or 122-124 randomly). Still have to use userspace > control (mode "1"). I use 4.13.0-041300rc3-generic, but still do not see the differences "pwm1_enable=1" or "pwm1_enable=2". 4.13.0-041300rc3-generic uses the default "pwm1_enable=2". I still need to run the daemon for monitoring the temperature and load of voltage relative fan speed. $ lsb_release -drc Description: Ubuntu 17.04 Release: 17.04 Codename: zesty $ uname -rso Linux 4.13.0-041300rc3-generic GNU/Linux I believe this is related. Might be a separate issue though. I have an Asus RX550. With the amdgpu drive my fan is at what I believe is 100% all the time. Though reporting doesn't work. I can change the pwm1_enable setting between 1/2 but no difference in the fan. $ cat /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 cat: pwm1: No such device Attempting to set pwm1 results in no change. $ sensors amdgpu-pci-2400 Adapter: PCI adapter fan1: N/A temp1: +45.0°C (crit = +0.0°C, hyst = +0.0°C) $ lsb_release -drc Description: Arch Linux Release: rolling Codename: n/a $ uname -rso Linux 4.12.10-1-ARCH GNU/Linux $ ls /sys/class/drm/card*/device/hwmon/hwmon*/pwm* $ ls /sys/class/drm/card*/device/hwmon/hwmon*/pwm* /sys/class/drm/card0/device/hwmon/hwmon0/pwm1 /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_enable /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_max /sys/class/drm/card0/device/hwmon/hwmon0/pwm1_min Lucas, the Asus RX-550 (and apparently cards from all manufacturers for this chipset) doesn't have any fan control. See my comment here: https://bugs.freedesktop.org/show_bug.cgi?id=97556#c7 Dimitrios, Good to know. Looks like my short term solution became a long term one. I unplugged the builtin fan and ziptied on another fan and connected it to a motherboard fan header. I might come back and extend the original fan header to reach the motherboard. Yeah I documented the workaround too: https://forum-en.msi.com/index.php?topic=298468.0 Root cause may be able to how amdgpu handles not being able to read it's powerplay settings because motherboard bioses (old AMI) don't set up MMIO BARs properly and intel submitted a patch to enforce restrictions on memory address regions in UEFI. (In reply to Luke McKee from comment #8) > Yeah I documented the workaround too: > https://forum-en.msi.com/index.php?topic=298468.0 > Please stop posting this on every bug report. (In reply to Alex Deucher from comment #9) > (In reply to Luke McKee from comment #8) > > Yeah I documented the workaround too: > > https://forum-en.msi.com/index.php?topic=298468.0 > > > > Please stop posting this on every bug report. That page is confusing and not likely related to any of these. In this case it was on topic. The link explains how to use fancontrol script from lm_sensors to work around fan control issues. I saw on another ticket when I first posted here that dc=1 fixed the fancontrol issues. Finally I got dc=1 working and still it doesn't resolve the dpm fancontrol issues on my platform. https://github.com/kobalicek/amdtweak as root # ./amdtweak --card 0 --verbose --extract-bios /tmp/amdbios.bin fails. The sysfs shows that the powerplay tables are not proper too. [ 4969.713277] resource sanity check: requesting [mem 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem 0x000c0000-0x000c3fff window] [ 4969.713283] caller pci_map_rom+0x66/0xf0 mapping multiple BARs [ 4969.713289] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: expecting 0xaa55, got 0xffff If it can't read it's powerplay table because it can't read the bios maybe that's why there is all these problems. (In reply to Alex Deucher from comment #9) > > Please stop posting this on every bug report. https://bugs.freedesktop.org/show_bug.cgi?id=100666#c0 Also the users above on this ticket above here when they grepped their dmesg wouldn't have output any powerplay mes.sages because they grepped radeon instead of amdgpu [ 10.124232] amdgpu: [powerplay] failed to send message 309 ret is 254 [ 10.124248] amdgpu: [powerplay] failed to send pre message 14e ret is 254 Maybe Denis could confirm or deny if this is in his dmesg? (In reply to Alex Deucher from comment #10) > > Please stop posting this on every bug report. That page is confusing and not likely related to any of these. You obviously know about this sir. https://bugs.freedesktop.org/attachment.cgi?id=135739 https://bugs.freedesktop.org/show_bug.cgi?id=98798 A new intel patch has caused a reversion to the behaviour in this old ticket. it's using pci_info not dev_info now. (In reply to Luke McKee from comment #11) > In this case it was on topic. The link explains how to use fancontrol script > from lm_sensors to work around fan control issues. I saw on another ticket > when I first posted here that dc=1 fixed the fancontrol issues. Finally I > got dc=1 working and still it doesn't resolve the dpm fancontrol issues on > my platform. dc and powerplay are largely independent. It's generally not likely that one will affect the other. > > https://github.com/kobalicek/amdtweak > as root > # ./amdtweak --card 0 --verbose --extract-bios /tmp/amdbios.bin > fails. The sysfs shows that the powerplay tables are not proper too. > I'm not familiar with that tool or how it goes about attempting to fetch the vbios. The driver uses several mechanism to fetch it depending on the platform. It's possible that tool does something weird to fetch the vbios and it's possible that tool incorrectly interprets some of the vbios tables. > [ 4969.713277] resource sanity check: requesting [mem > 0x000c0000-0x000dffff], which spans more than PCI Bus 0000:00 [mem > 0x000c0000-0x000c3fff window] > [ 4969.713283] caller pci_map_rom+0x66/0xf0 mapping multiple BARs > [ 4969.713289] amdgpu 0000:01:00.0: Invalid PCI ROM header signature: > expecting 0xaa55, got 0xffff This last message is from the pci subsystem and is harmless. If the driver were not able to load the vbios, it would fail to load. > > If it can't read it's powerplay table because it can't read the bios maybe > that's why there is all these problems. The driver is able to load the vbios image just fine. If it wasn't able to, or if there was a major problem with one of the tables, the driver would fail to load. > > > (In reply to Alex Deucher from comment #9) > > > > Please stop posting this on every bug report. > > https://bugs.freedesktop.org/show_bug.cgi?id=100666#c0 > Also the users above on this ticket above here when they grepped their dmesg > wouldn't have output any powerplay mes.sages because they grepped radeon > instead of amdgpu > > [ 10.124232] amdgpu: [powerplay] > failed to send message 309 ret is 254 > [ 10.124248] amdgpu: [powerplay] > failed to send pre message 14e ret is 254 > There are lots of reasons an smu message might fail. Just because you see an smu message failure does not mean you are seeing the same issue as someone else. It's like a GPU hang. There are lots of potential root causes. Alex thanks for your help. How it gets the rom is probably the same as this shell script using the pci method listed on that github link in the last comment. # To read ROM you first need to write `1` to it, then read it, and then write # `0` to it as described in the documentation. The reason is that the content # is not provided by default, by writing `1` to it you are telling the driver # to make it accessible. CARD_ID=0 CARD_ROM="/sys/class/drm/card${CARD_ID}/device/rom" FILE_ROM="amdgpu-rom.bin" echo 1 > $CARD_ROM cat $CARD_ROM > $FILE_ROM echo 0 > $CARD_ROM echo "Saved as ${FILE_ROM}" -- output: cat: /sys/class/drm/card0/device/rom: Input/output error [Not] Saved as amdgpu-rom.bin Is there any other user-space accessible methods to extract / write the rom in Linux? Now only focusing on comparing the pp table to other roms not modifying it. Maybe the powerplay is an atom-bios issue perhaps. If it's still broken in this 4.16-rc1 version I'm trying out now I'll open a ticket. For your reference this is the ticket that claims powerplay dpm is fixed in newer kernels / dc=1 in 4.15. https://bugs.freedesktop.org/show_bug.cgi?id=100443#c37 (In reply to Luke McKee from comment #14) > Alex thanks for your help. > > How it gets the rom is probably the same as this shell script using the pci > method listed on that github link in the last comment. > > # To read ROM you first need to write `1` to it, then read it, and then write > # `0` to it as described in the documentation. The reason is that the content > # is not provided by default, by writing `1` to it you are telling the driver > # to make it accessible. > CARD_ID=0 > CARD_ROM="/sys/class/drm/card${CARD_ID}/device/rom" > FILE_ROM="amdgpu-rom.bin" > > echo 1 > $CARD_ROM > cat $CARD_ROM > $FILE_ROM > echo 0 > $CARD_ROM > echo "Saved as ${FILE_ROM}" > > -- > output: > cat: /sys/class/drm/card0/device/rom: Input/output error > [Not] Saved as amdgpu-rom.bin That should generally work for desktop discrete cards. You need to be root however. > > Is there any other user-space accessible methods to extract / write the rom > in Linux? Now only focusing on comparing the pp table to other roms not > modifying it. You can read the amdgpu_vbios file in debugfs. That will dump the copy of the vbios that the driver is using. > Maybe the powerplay is an atom-bios issue perhaps. If it's still broken in > this 4.16-rc1 version I'm trying out now I'll open a ticket. > > For your reference this is the ticket that claims powerplay dpm is fixed in > newer kernels / dc=1 in 4.15. > https://bugs.freedesktop.org/show_bug.cgi?id=100443#c37 There's no confirmation that specifically enabling dc fixed it. Anyway, we are cluttering up this bug with potentially unrelated information. Please file a new bug for your issue and we can discuss it there. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/152. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.