Bug 100443

Summary:

DMESG: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!

Product:

DRI

Reporter:

Christian Lanig <freedesktop>

Component:

DRM/AMDgpu

Assignee:

Default DRI bug account <dri-devel>

Status:

RESOLVED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

jan.public, j.gjorgji, koesterreich, michael, taijian, takeshi.ogasawara

Version:

unspecified

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
DMESG excerpt	none
DMESG_NEW	none
Bios values	none
relevant dmesg output	none
lshw output reference in comment 4	none
lspci output referenced above	none
new dmesg output with kernel 4.12-rc1	none
new dmesg output with kernel 4.12	none
Another dmesg log with kernel 4.12	none
dmesg kernel 4.12.4 Gigabyte AB350 motherboard Ryzen 1600	none
relevant dmesg output with kernel 4.13.3	none
dmesg with kernel 4.13.6	none

Description Christian Lanig 2017-03-29 08:04:30 UTC

Created attachment 130529 [details]
DMESG excerpt

System:
Ubuntu 17.04 + Kernel 4.11 RC4 + Padoka PPA
AMD Ryzen 7 1700X
ASRock AB350 Pro4 Bios v1.40 (because 2.20 breaks ECC)
XfX RX 480 RS
2x16GiB Crucial CT2K16G4XFD824A ECC Kit

After installing my new motherboard and CPU I just went through the system messages to see how well it's supported or if there are some issues to be sorted out.
Unfortunately it looks like there are issues with AMDGPU, at the initialization there are really massive warnings and errors.

It starts with these several times:
[    1.482118] AMD-Vi: Event logged [
[    1.482118] IO_PAGE_FAULT device=0c:00.0 domain=0x0003 address=0x000000f4001e6a00 flags=0x0010]

And ends in telling me that something is wrong with the powerplay:
[    1.526878] amdgpu: [powerplay] [AVFS] Something is broken. See log!
[    1.528706] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!

This report has been separated from this one:
https://bugs.freedesktop.org/show_bug.cgi?id=100437

Comment 1 Christian Lanig 2017-03-29 20:17:34 UTC

Created attachment 130551 [details]
DMESG_NEW

The IO- Errors have gone after I upgraded my OS. I updated the DMESG- file.

Comment 2 Christian Lanig 2017-03-29 20:48:20 UTC

Created attachment 130553 [details]
Bios values

I verified my bios against another one from the Internet and had a look at the powerplay tables but I can quite likely eliminate a broken bios as a cause.

Comment 3 Christian Lanig 2017-04-03 14:34:33 UTC

I have looked a bit around for the cause of these messages. In the first case I guess I found a reason:
/*Driver went offline but SMU was still alive and contains the VFT table */
in line 340 in polaris10_smumgr.c

In case of the second message in line 575 in hwmgr.c the driver searches for the maximum leakage voltage I guess, where the virtual voltage ID can have values between 0xFF01 - 0xFF08, probably kind of representing the power states ATOM_VIRTUAL_VOLTAGE_ID0 - ATOM_VIRTUAL_VOLTAGE_ID7.

But it doesn't find the specific voltage value in the table so instead of setting the value of the *sclk destination it prints the assertion. That's kind of my interpretation.
Unfortunately the rest is too intricate for me.

The messages disappeared with an R9 290X.

The RX 480-BIOS can be downloaded here:
https://www.techpowerup.com/vgabios/184548/xfx-rx480-8192-160614

Comment 4 taijian 2017-04-13 10:32:17 UTC

Created attachment 130826 [details]
relevant dmesg output

I have a similar problem on my new notebook.

System:
Arch Linux - Kernel 4.11-rc6
Intel i7-7700HQ
AMD RX470 dGPU

dmesg, lshw, and lspci outputs are attached.

Comment 5 taijian 2017-04-13 10:32:58 UTC

Created attachment 130827 [details]
lshw output reference in comment 4

Comment 6 taijian 2017-04-13 10:33:36 UTC

Created attachment 130828 [details]
lspci output referenced above

Comment 7 Gjorgji Jankovski 2017-05-07 06:17:45 UTC

I can confirm this as well, hardware:


Ryzen 7 1700
MSI B350 TOMAHAWK
Xfx RS RX 480 4GB

Fedora 26
Kernel: 4.11.0-1.fc26.x86_64

Comment 8 takeshi ogasawara 2017-05-16 08:37:58 UTC

Hi 

The same thing happens to my system

System:
Ubuntu 17.04 + Kernel 4.12rc1
AMD Ryzen 7 1700X
ASRock X370 taichi Bios v2.20
Sapphire R9 Fury 4 GB Nitro
2x16GiB

[    1.798013] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
[    1.805082] amdgpu: [powerplay] Failed to setup PCC HW register! Wrong GPIO assigned for VDDC_PCC_GPIO_PINID!


When Xorg got settled, the following message remains in ker.log

May 16 17:03:57 anilaR9 kernel: [  725.962865] INFO: task amdgpu_cs:0:1365 blocked for more than 120 seconds.
May 16 17:03:57 anilaR9 kernel: [  725.962873] amdgpu_cs:0     D    0  1365   1318 0x00400000
May 16 17:03:57 anilaR9 kernel: [  725.962916]  amdgpu_ctx_add_fence+0x63/0x100 [amdgpu]
May 16 17:03:57 anilaR9 kernel: [  725.962936]  amdgpu_cs_ioctl+0x107a/0x1410 [amdgpu]
May 16 17:03:57 anilaR9 kernel: [  725.962966]  ? amdgpu_cs_find_mapping+0xa0/0xa0 [amdgpu]
May 16 17:03:57 anilaR9 kernel: [  725.962983]  amdgpu_drm_ioctl+0x4f/0x90 [amdgpu]

Comment 9 taijian 2017-05-16 10:08:47 UTC

Created attachment 131372 [details]
new dmesg output with kernel 4.12-rc1

after updating the kernel to 4.12-rc1 at least some of the error messages have shifted: 

[drm:amdgpu_suspend [amdgpu]] *ERROR* suspend of IP block <vce_v3_0> failed -110

has gone away, but there is still plenty of this going on:

[   18.422475] amdgpu: [powerplay] 
                failed to send pre message 62 ret is 0 

And of course this is also still a thing:

[    3.053342] amdgpu: [powerplay] [AVFS] Something is broken. See log!
[    3.055251] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!

Comment 10 taijian 2017-07-09 09:56:47 UTC

Created attachment 132567 [details]
new dmesg output with kernel 4.12

Comment 11 Christian Lanig 2017-07-09 12:41:17 UTC

Created attachment 132574 [details]
Another dmesg log with kernel 4.12

Neither "suspend of IP block <vce_v3_0> failed" nor "failed to send pre message 62 ret is 0" is displayed by dmesg with my setup.

Everything stays the same except the message "amdgpu VM size (-1) must be a power of 2" which has appeared but already been patched.

Comment 12 taijian 2017-07-09 13:03:34 UTC

(In reply to Christian Lanig from comment #11)
> Created attachment 132574 [details]
> Another dmesg log with kernel 4.12
> 
> Neither "suspend of IP block <vce_v3_0> failed" nor "failed to send pre
> message 62 ret is 0" is displayed by dmesg with my setup.

The difference here is probably that your system is a desktop, whereas I have a laptop with a hybrid Intel/AMD GPU setup (HD630 + RX470). So your GPU probably does not try to power down, because it is always needed to drive your display?

Comment 13 Gjorgji Jankovski 2017-07-10 19:02:51 UTC

No change with kernel 4.12 for me.

Comment 14 alvarex 2017-08-05 22:27:09 UTC

I have the same problem I modded a bios and dumped the original before. Now I don't know the reason but the kernel thinks the bios is broken. I don't know maybe if you a flash a none original bios you are screwed forever and you cannot fix it even by flashing the factory bios. That wouldn't surprise me since that's the way the warranty and rma works and maybe AMD wants to enforce this.

Comment 15 alvarex 2017-08-05 22:28:14 UTC

Created attachment 133261 [details]
dmesg kernel 4.12.4 Gigabyte AB350 motherboard Ryzen 1600

Comment 16 alvarex 2017-08-05 23:08:28 UTC

I asked Gigabyte to send me the original bios, maybe the bioses we are using were not dumped correctly.

Comment 17 alvarex 2017-08-05 23:16:19 UTC

I also get a lot of "ret" messages, but different. 
                
                failed to send message 18a ret is 255 
                failed to send pre message 145 ret is 255 
                failed to send message 18a ret is 255 
                failed to send pre message 145 ret is 255 

My card is stuck at 300mhz memory speed and it doesn't matter what I do, it won't change, I even tried using the RoCm stack with that custom kernel and using the rocm utility to set the memory performance level, and it also wouldn't let me.

Comment 18 alvarex 2017-08-06 15:08:42 UTC

when running opengl the card is still downclocked and I get this message

INVALID_DEVICE_REQUEST device=00:00.0 address=0xfffffffdf8000000 flags=0x0a00]

Comment 19 alvarex 2017-08-06 15:12:01 UTC

I also get IO_PAGE_FAULT

Comment 20 alvarex 2017-08-06 21:08:55 UTC

Ok I looked into the bios with an editor and I think I found the root of the problem. My bios and maybe the others as well, instead of enumerating the voltage in a standard way (in mV), the use some sort of code which is an abnormally high number which ( some sort magic code) to set the voltage to an unspecified number. I looked into several bios and all have the same problem, the voltage the use is a very big number like 65550 mV which in practice is impossible to set. The kernel doesn't like that and defaults to a standard value provided in the bios which in my case is the lower value of 800mV.  And the kernel in my case won't use those magic values because the bios got unsigned when I dumped it. My chances are that Gigabyte provides me a good signed bios or I can edit the bios manually to use normal mV values.

Comment 21 Dieter Nützel 2017-08-06 21:29:57 UTC

Same here:

[    4.021250] amdgpu: [powerplay] amdgpu: powerplay sw initialized
[    4.067206] amdgpu: [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!

Sapphire Nitro+ RX580 8 GB
amd-staging-4.11-1.g7262353-default

lspci -v show this:
01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Ellesmere [Radeon RX 470/480/570/580] (rev e7) (prog-if 00 [VGA controller])
        Subsystem: Sapphire Technology Limited Radeon RX 570

But it _is_ RX580 (Polaris 20).

Comment 22 Andy Furniss 2017-08-06 22:26:21 UTC

I've always seen 

[powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!

on R9 285 Tonga, but powerplay works OK and there are no other errors - so I assumed it's harmless when seen alone.

Comment 23 Dieter Nützel 2017-08-06 23:10:42 UTC

(In reply to Andy Furniss from comment #22)
> I've always seen 
> 
> [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
> 
> on R9 285 Tonga, but powerplay works OK and there are no other errors - so I
> assumed it's harmless when seen alone.

Yes good point Andy,
Alex told me the same some weeks ago, if remember right.
But 'Zero-Fan-Mode' (right term?) do not work on my RX580.
Maybe other bug report needed.

Comment 24 alvarex 2017-08-07 11:08:48 UTC

In my case make my card completely useless stuck on 300mhz of memory clock and 174mhz of GPU.

Comment 25 alvarex 2017-08-07 11:14:19 UTC

Is there any sort of bios signature enforcement that makes powerplay go bananas when the bios is not signed??

Comment 26 alvarex 2017-08-07 11:22:55 UTC

I tried amd staging kernel and it boots but it spams like crazy message about IO ERROR and IRQ NOTHREAD, and I can't do nothing just hard reboot.

Comment 27 alvarex 2017-08-07 11:43:30 UTC

I opened a new bug report https://bugs.freedesktop.org/show_bug.cgi?id=102068

Comment 28 alvarex 2017-08-07 12:06:39 UTC

Is there any way of disabling powerplay ?? amdgpu.powerplay=0 on grub, and options powerplay=0 on kernel loading doesn't work

Comment 29 alvarex 2017-08-07 15:22:24 UTC

just ignore my comments I was setting /sys/class/drm/card0/device/pp_mclk_od incorrectly.

Comment 30 Andy Furniss 2017-08-11 15:12:02 UTC

(In reply to Andy Furniss from comment #22)
> I've always seen 
> 
> [powerplay] Can't find requested voltage id in vdd_dep_on_sclk table!
> 
> on R9 285 Tonga, but powerplay works OK and there are no other errors - so I
> assumed it's harmless when seen alone.

Looks like this one will be disappearing as time goes on -

https://cgit.freedesktop.org/~agd5f/linux/commit/?h=drm-next-4.14-wip&id=1ed18c2100eb471eaf6d973a0ef4421721b4cd06

Comment 31 takeshi ogasawara 2017-09-16 00:22:13 UTC

Hi

I could not identify the cause and tried to replace the power supply unit.

bf:SILVERSTONE SST-ST75F-P (750W)
af:Seasonic SSR-750TD (750W)

The error in the title of this ticket stopped after exchanging
I hope it will be helpful

Comment 32 takeshi ogasawara 2017-09-26 08:58:05 UTC

Hi

After upgrading from kernel 4.12.5 to 4.13.3, the event recurred again.

Comment 33 taijian 2017-09-28 18:01:38 UTC

Created attachment 134553 [details]
relevant dmesg output with kernel 4.13.3

New dmesg output with kernel 4.13.3 AND a new acpi_osi setting after I decompiled my DSDT to see just how borked my firmware is for non-W10 OSs.

Comment 34 Devin Prince 2017-10-13 07:00:28 UTC

Having this same issue on kernel 4.13.5-1-ARCH.

CPU: AMD R7-1700X
Motherboard: AsRock Taichi X370
GPU: XFX RX-480 8GB

I have found a temporary solution that seems to allow me to actually boot and use the card, however.

Pass the kernel parameter:
amdgpu.dpm=0
to shut off the Dynamic Power Management module, which contains AMD's powerplay, and everything works fine.

I'm unsure what sort of limitations this introduces, but for the time being, it will get your system working.

Comment 35 Devin Prince 2017-10-13 11:49:51 UTC

Upon further investigation, I would like to ammend my previous comment.

Setting amdgpu.dpm=0 will allow you to bypass those errors by disabling dpm altogether, but the consequence is that the fans will stop spinning almost entirely (slow or no fan speeds when they should be ramping up).

Instead, set:

    amdgpu.ppfeaturemask=1

which enables the card to broadcast all of its power states properly. With this enabled, all the errors mentioned in this thread vanished from my system, and in addition, the hwmon features actually showed up under /sys/class/drm/card0/device/hwmon/hwmon0/ and I can even echo speeds to pwm1.

I'm unsure if this has any adverse effects on anything, but if it does not, perhaps ppfeaturemask should be set to default enabled in the amdgpu driver.

Comment 36 taijian 2017-10-16 14:05:13 UTC

Created attachment 134863 [details]
dmesg with kernel 4.13.6

For me amdgpu.ppfeaturemask=1 does not change anything, sadly...

Comment 37 taijian 2017-11-28 20:37:25 UTC

OK, so on 4.15-rc1 with amdgpu.dc=1 the error has disappeared for me. Seems to have been fixed.

Comment 38 Christian Lanig 2017-12-15 08:08:00 UTC

Yes, with Kernel 4.14 the issue doesn't appear anymore, instead everything looks fine:
amdgpu: [powerplay] amdgpu: powerplay sw initialized

I will close this, when someone has another issue, please make a separate report, or when it's not fixed for you with current Kernel, feel free to reopening it.

Comment 39 Devin Prince 2018-09-20 04:17:11 UTC

Hi Christian,

I hope it's okay if I re-open this bug. I still seem to be having it, almost a year later.

A while ago (not exactly sure when), ppfeaturemask stopped working for me, and now I am required to set amdgpu.dpm to 0 in order to successfully boot to a GUI, otherwise it hangs after a few drivers load. I can do most things, but performance is very poor. Setting dpm=0 and ppfeaturemask=1 or 0xffffffff has a strange effect where it will successfully boot, but I can't load any sort of GUI, saying it can't detect or connect to a display (with and without X configs).

What sort of identifier would be needed to integrate this card into the driver? And what other information would you find useful (commands I can run would be helpful)

Comment 40 Alex Deucher 2018-09-20 04:27:40 UTC

(In reply to Devin Prince from comment #39)
> Hi Christian,
> 
> I hope it's okay if I re-open this bug. I still seem to be having it, almost
> a year later.
> 
> A while ago (not exactly sure when), ppfeaturemask stopped working for me,
> and now I am required to set amdgpu.dpm to 0 in order to successfully boot
> to a GUI, otherwise it hangs after a few drivers load. I can do most things,
> but performance is very poor. Setting dpm=0 and ppfeaturemask=1 or
> 0xffffffff has a strange effect where it will successfully boot, but I can't
> load any sort of GUI, saying it can't detect or connect to a display (with
> and without X configs).
> 
> What sort of identifier would be needed to integrate this card into the
> driver? And what other information would you find useful (commands I can run
> would be helpful)

Sounds like you may have a different issue.  Please file a new bug and include your dmesg output and xorg log if you are using X.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.