Summary: | DMESG: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! | ||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Christian Lanig <freedesktop> | ||||||||||||||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||||||
Priority: | medium | CC: | jan.public, j.gjorgji, koesterreich, michael, taijian, takeshi.ogasawara | ||||||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||||
Attachments: |
|
Description
Christian Lanig
2017-03-29 08:04:30 UTC
Created attachment 130551 [details]
DMESG_NEW
The IO- Errors have gone after I upgraded my OS. I updated the DMESG- file.
Created attachment 130553 [details]
Bios values
I verified my bios against another one from the Internet and had a look at the powerplay tables but I can quite likely eliminate a broken bios as a cause.
I have looked a bit around for the cause of these messages. In the first case I guess I found a reason: /*Driver went offline but SMU was still alive and contains the VFT table */ in line 340 in polaris10_smumgr.c In case of the second message in line 575 in hwmgr.c the driver searches for the maximum leakage voltage I guess, where the virtual voltage ID can have values between 0xFF01 - 0xFF08, probably kind of representing the power states ATOM_VIRTUAL_VOLTAGE_ID0 - ATOM_VIRTUAL_VOLTAGE_ID7. But it doesn't find the specific voltage value in the table so instead of setting the value of the *sclk destination it prints the assertion. That's kind of my interpretation. Unfortunately the rest is too intricate for me. The messages disappeared with an R9 290X. The RX 480-BIOS can be downloaded here: https://www.techpowerup.com/vgabios/184548/xfx-rx480-8192-160614 Created attachment 130826 [details]
relevant dmesg output
I have a similar problem on my new notebook.
System:
Arch Linux - Kernel 4.11-rc6
Intel i7-7700HQ
AMD RX470 dGPU
dmesg, lshw, and lspci outputs are attached.
Created attachment 130827 [details] lshw output reference in comment 4 Created attachment 130828 [details]
lspci output referenced above
I can confirm this as well, hardware: Ryzen 7 1700 MSI B350 TOMAHAWK Xfx RS RX 480 4GB Fedora 26 Kernel: 4.11.0-1.fc26.x86_64 Hi The same thing happens to my system System: Ubuntu 17.04 + Kernel 4.12rc1 AMD Ryzen 7 1700X ASRock X370 taichi Bios v2.20 Sapphire R9 Fury 4 GB Nitro 2x16GiB [ 1.798013] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! [ 1.805082] amdgpu: [powerplay] Failed to setup PCC HW register! Wrong GPIO assigned for VDDC_PCC_GPIO_PINID! When Xorg got settled, the following message remains in ker.log May 16 17:03:57 anilaR9 kernel: [ 725.962865] INFO: task amdgpu_cs:0:1365 blocked for more than 120 seconds. May 16 17:03:57 anilaR9 kernel: [ 725.962873] amdgpu_cs:0 D 0 1365 1318 0x00400000 May 16 17:03:57 anilaR9 kernel: [ 725.962916] amdgpu_ctx_add_fence+0x63/0x100 [amdgpu] May 16 17:03:57 anilaR9 kernel: [ 725.962936] amdgpu_cs_ioctl+0x107a/0x1410 [amdgpu] May 16 17:03:57 anilaR9 kernel: [ 725.962966] ? amdgpu_cs_find_mapping+0xa0/0xa0 [amdgpu] May 16 17:03:57 anilaR9 kernel: [ 725.962983] amdgpu_drm_ioctl+0x4f/0x90 [amdgpu] Created attachment 131372 [details]
new dmesg output with kernel 4.12-rc1
after updating the kernel to 4.12-rc1 at least some of the error messages have shifted:
[drm:amdgpu_suspend [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110
has gone away, but there is still plenty of this going on:
[ 18.422475] amdgpu: [powerplay]
failed to send pre message 62 ret is 0
And of course this is also still a thing:
[ 3.053342] amdgpu: [powerplay] [AVFS] Something is broken. See log!
[ 3.055251] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
Created attachment 132567 [details]
new dmesg output with kernel 4.12
Created attachment 132574 [details]
Another dmesg log with kernel 4.12
Neither "suspend of IP block <vce_v3_0> failed" nor "failed to send pre message 62 ret is 0" is displayed by dmesg with my setup.
Everything stays the same except the message "amdgpu VM size (-1) must be a power of 2" which has appeared but already been patched.
(In reply to Christian Lanig from comment #11) > Created attachment 132574 [details] > Another dmesg log with kernel 4.12 > > Neither "suspend of IP block <vce_v3_0> failed" nor "failed to send pre > message 62 ret is 0" is displayed by dmesg with my setup. The difference here is probably that your system is a desktop, whereas I have a laptop with a hybrid Intel/AMD GPU setup (HD630 + RX470). So your GPU probably does not try to power down, because it is always needed to drive your display? No change with kernel 4.12 for me. I have the same problem I modded a bios and dumped the original before. Now I don't know the reason but the kernel thinks the bios is broken. I don't know maybe if you a flash a none original bios you are screwed forever and you cannot fix it even by flashing the factory bios. That wouldn't surprise me since that's the way the warranty and rma works and maybe AMD wants to enforce this. Created attachment 133261 [details]
dmesg kernel 4.12.4 Gigabyte AB350 motherboard Ryzen 1600
I asked Gigabyte to send me the original bios, maybe the bioses we are using were not dumped correctly. I also get a lot of "ret" messages, but different. failed to send message 18a ret is 255 failed to send pre message 145 ret is 255 failed to send message 18a ret is 255 failed to send pre message 145 ret is 255 My card is stuck at 300mhz memory speed and it doesn't matter what I do, it won't change, I even tried using the RoCm stack with that custom kernel and using the rocm utility to set the memory performance level, and it also wouldn't let me. when running opengl the card is still downclocked and I get this message INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00] I also get IO_PAGE_FAULT Ok I looked into the bios with an editor and I think I found the root of the problem. My bios and maybe the others as well, instead of enumerating the voltage in a standard way (in mV), the use some sort of code which is an abnormally high number which ( some sort magic code) to set the voltage to an unspecified number. I looked into several bios and all have the same problem, the voltage the use is a very big number like 65550 mV which in practice is impossible to set. The kernel doesn't like that and defaults to a standard value provided in the bios which in my case is the lower value of 800mV. And the kernel in my case won't use those magic values because the bios got unsigned when I dumped it. My chances are that Gigabyte provides me a good signed bios or I can edit the bios manually to use normal mV values. Same here: [ 4.021250] amdgpu: [powerplay] amdgpu: powerplay sw initialized [ 4.067206] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! Sapphire Nitro+ RX580 8 GB amd-staging-4.11-1.g7262353-default lspci -v show this: 01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] (rev e7) (prog-if 00 [VGA controller]) Subsystem: Sapphire Technology Limited Radeon RX 570 But it _is_ RX580 (Polaris 20). I've always seen [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! on R9 285 Tonga, but powerplay works OK and there are no other errors - so I assumed it's harmless when seen alone. (In reply to Andy Furniss from comment #22) > I've always seen > > [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! > > on R9 285 Tonga, but powerplay works OK and there are no other errors - so I > assumed it's harmless when seen alone. Yes good point Andy, Alex told me the same some weeks ago, if remember right. But 'Zero-Fan-Mode' (right term?) do not work on my RX580. Maybe other bug report needed. In my case make my card completely useless stuck on 300mhz of memory clock and 174mhz of GPU. Is there any sort of bios signature enforcement that makes powerplay go bananas when the bios is not signed?? I tried amd staging kernel and it boots but it spams like crazy message about IO ERROR and IRQ NOTHREAD, and I can't do nothing just hard reboot. I opened a new bug report https://bugs.freedesktop.org/show_bug.cgi?id=102068 Is there any way of disabling powerplay ?? amdgpu.powerplay=0 on grub, and options powerplay=0 on kernel loading doesn't work just ignore my comments I was setting /sys/class/drm/card0/device/pp_mclk_od incorrectly. (In reply to Andy Furniss from comment #22) > I've always seen > > [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table! > > on R9 285 Tonga, but powerplay works OK and there are no other errors - so I > assumed it's harmless when seen alone. Looks like this one will be disappearing as time goes on - https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.14-wip&id=1ed18c2100eb471eaf6d973a0ef4421721b4cd06 Hi I could not identify the cause and tried to replace the power supply unit. bf:SILVERSTONE SST-ST75F-P (750W) af:Seasonic SSR-750TD (750W) The error in the title of this ticket stopped after exchanging I hope it will be helpful Hi After upgrading from kernel 4.12.5 to 4.13.3, the event recurred again. Created attachment 134553 [details]
relevant dmesg output with kernel 4.13.3
New dmesg output with kernel 4.13.3 AND a new acpi_osi setting after I decompiled my DSDT to see just how borked my firmware is for non-W10 OSs.
Having this same issue on kernel 4.13.5-1-ARCH. CPU: AMD R7-1700X Motherboard: AsRock Taichi X370 GPU: XFX RX-480 8GB I have found a temporary solution that seems to allow me to actually boot and use the card, however. Pass the kernel parameter: amdgpu.dpm=0 to shut off the Dynamic Power Management module, which contains AMD's powerplay, and everything works fine. I'm unsure what sort of limitations this introduces, but for the time being, it will get your system working. Upon further investigation, I would like to ammend my previous comment. Setting amdgpu.dpm=0 will allow you to bypass those errors by disabling dpm altogether, but the consequence is that the fans will stop spinning almost entirely (slow or no fan speeds when they should be ramping up). Instead, set: amdgpu.ppfeaturemask=1 which enables the card to broadcast all of its power states properly. With this enabled, all the errors mentioned in this thread vanished from my system, and in addition, the hwmon features actually showed up under /sys/class/drm/card0/device/hwmon/hwmon0/ and I can even echo speeds to pwm1. I'm unsure if this has any adverse effects on anything, but if it does not, perhaps ppfeaturemask should be set to default enabled in the amdgpu driver. Created attachment 134863 [details]
dmesg with kernel 4.13.6
For me amdgpu.ppfeaturemask=1 does not change anything, sadly...
OK, so on 4.15-rc1 with amdgpu.dc=1 the error has disappeared for me. Seems to have been fixed. Yes, with Kernel 4.14 the issue doesn't appear anymore, instead everything looks fine: amdgpu: [powerplay] amdgpu: powerplay sw initialized I will close this, when someone has another issue, please make a separate report, or when it's not fixed for you with current Kernel, feel free to reopening it. Hi Christian, I hope it's okay if I re-open this bug. I still seem to be having it, almost a year later. A while ago (not exactly sure when), ppfeaturemask stopped working for me, and now I am required to set amdgpu.dpm to 0 in order to successfully boot to a GUI, otherwise it hangs after a few drivers load. I can do most things, but performance is very poor. Setting dpm=0 and ppfeaturemask=1 or 0xffffffff has a strange effect where it will successfully boot, but I can't load any sort of GUI, saying it can't detect or connect to a display (with and without X configs). What sort of identifier would be needed to integrate this card into the driver? And what other information would you find useful (commands I can run would be helpful) (In reply to Devin Prince from comment #39) > Hi Christian, > > I hope it's okay if I re-open this bug. I still seem to be having it, almost > a year later. > > A while ago (not exactly sure when), ppfeaturemask stopped working for me, > and now I am required to set amdgpu.dpm to 0 in order to successfully boot > to a GUI, otherwise it hangs after a few drivers load. I can do most things, > but performance is very poor. Setting dpm=0 and ppfeaturemask=1 or > 0xffffffff has a strange effect where it will successfully boot, but I can't > load any sort of GUI, saying it can't detect or connect to a display (with > and without X configs). > > What sort of identifier would be needed to integrate this card into the > driver? And what other information would you find useful (commands I can run > would be helpful) Sounds like you may have a different issue. Please file a new bug and include your dmesg output and xorg log if you are using X. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.