Bug 110777 - Kernel 5.1-5.2 MCLK stuck at 167MHz Vega 10 (56)
Summary: Kernel 5.1-5.2 MCLK stuck at 167MHz Vega 10 (56)
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium blocker
Assignee: Alex Deucher
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-27 17:56 UTC by Anton Herzfeld
Modified: 2019-09-15 19:56 UTC (History)
7 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Tweaked PowerPlay table (694 bytes, application/octet-stream)
2019-09-13 13:51 UTC, Térence Clastres
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Anton Herzfeld 2019-05-27 17:56:20 UTC
Hi,

Since Kernel 5.1 up-to Kernel 5.2 my Vega 56 card's memory clock is stuck at 167MHz and does not boost up any more.
The exact same setup boosts fine to 1000MHz memclk when running Kernel 5.0.13.

Is there any info I can provide to get this fixed?
Comment 1 Anton Herzfeld 2019-06-02 17:40:02 UTC
This is still occuring on latest linux master cd6c84d8f0cdc911df435bb075ba22ce3c605b07
Comment 2 Anton Herzfeld 2019-06-03 18:41:33 UTC
The issue is fully fixed on kernel master (currently I am using commit 460b48a0fefce25beb0fc0139e721c5691d65d7f) when reverting drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c back to the state it was around kernel 5.0.13.

https://git.archlinux.org/linux.git/tree/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c?h=v5.0.13-arch1

I will start bisecting soon to figure out the exact commit that has caused the issue.
Comment 3 Anton Herzfeld 2019-06-03 18:55:18 UTC
reverting the following two patches fixes the boost in memory clocks but it seems once mem clock has ramped up it's not going down again.

1.
Revert "drm/amd/powerplay: update soc boot and max level on vega10"   
This reverts commit 373e87fc91527124cb8ec21465a6d070a65c56af.

2.
Revert "drm/amd/powerplay: support Vega10 SOCclk and DCEFclk dpm level settings"
This reverts commit bb05821b13fa0c0b97760cb292b30d3105d65954.

Evan Quan <evan.quan@amd.com>
Alex Deucher <alexander.deucher@amd.com>
Comment 4 Anton Herzfeld 2019-06-03 19:21:53 UTC
Is there anything else I can provide to support getting this fixed?
Comment 5 Anton Herzfeld 2019-06-05 17:44:58 UTC
The following patch fixes the issue with boosting again:

https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce

however it also seems to expose the issue in mclk not going down again once it has boosted.

just to clarify the issue occurs when using manual OD on mclk since kernel 5.1.
Comment 6 Anton Herzfeld 2019-06-13 16:52:07 UTC
@Alex Deucher is there any chance we can get a backport of https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce

into the 5.1 Kernel? This Kernel is broken for Vega 10 otherwise (Kernel 5.2 is also still broken).
Comment 7 haro41 2019-07-11 15:54:02 UTC
I have to confirm this issue with kernel 5.2. 
HBM2 clocks are at 167MHz if i try to overclock memory via write to:
/sys/class/drm/card0/device/pp_od_clk_voltage or
/sys/class/drm/card0/device/pp_table

Bellow is an output of a monitoring script:

gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1625 pwm: 109 pow: 165000000
gpu_vdd: 1100 gpu_clk: 1623000000 mem_clk: 167000000 temp: 39000 fan: 1608 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1637000000 mem_clk: 167000000 temp: 41000 fan: 1603 pwm: 109 pow: 161000000
gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 160000000
gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1618 pwm: 109 pow: 157000000
gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1610 pwm: 109 pow: 159000000
gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 159000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1601 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 161000000
Comment 8 velemas 2019-07-13 21:37:58 UTC
I confirm I have the same issue on my Acer Predator Helios 500 with Vega 56.
Comment 9 velemas 2019-07-22 19:55:49 UTC
Sometimes MCLK gets stuck on 500Mhz and SCLK on 879MHz. With these clocks after some time under load my laptop makes sound notification as if power cord was disconnected and power led also switches off, then the screen looks like a TV blank screen with white noise. And I have to reboot.

Still reproducible on 5.2.2-arch1.
Comment 10 Térence Clastres 2019-08-24 00:54:54 UTC
Same behaviour using pptables: memory either get stuck to 167MHz (level 0) or 800MHz (level 2), on 5.2 with https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand should pull latest changes to amdgpu.

If using the classic `echo "m 3 200 1050" | sudo tee  /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an absurd memory frequency like 1400MHz which is reported to be used on my different cli tools, but it doesn't look like it does anything.
Comment 11 velemas 2019-08-25 15:44:00 UTC
(In reply to Térence Clastres from comment #10)
> Same behaviour using pptables: memory either get stuck to 167MHz (level 0)
> or 800MHz (level 2), on 5.2 with
> https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand
> should pull latest changes to amdgpu.
> 
> If using the classic `echo "m 3 200 1050" | sudo tee 
> /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an
> absurd memory frequency like 1400MHz which is reported to be used on my
> different cli tools, but it doesn't look like it does anything.

Yes, I also observe sometimes unreal freqs like 2131MHz or something. But I've noticed that when I plug the power cord of my laptop after kernel is booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz which is enough for all my games. But if a game is more demanding then the whole system may fail with TV static effect but it can be workarounded by sending "manual" to /sys/class/drm/card0/device/power_dpm_force_performance_level (or using corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5 in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that setting my system is stable.

TV static is also observable when using suspend2ram on resume.
Comment 12 Térence Clastres 2019-08-25 18:10:23 UTC
(In reply to velemas from comment #11)
> (In reply to Térence Clastres from comment #10)
> > Same behaviour using pptables: memory either get stuck to 167MHz (level 0)
> > or 800MHz (level 2), on 5.2 with
> > https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand
> > should pull latest changes to amdgpu.
> > 
> > If using the classic `echo "m 3 200 1050" | sudo tee 
> > /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an
> > absurd memory frequency like 1400MHz which is reported to be used on my
> > different cli tools, but it doesn't look like it does anything.
> 
> Yes, I also observe sometimes unreal freqs like 2131MHz or something. But
> I've noticed that when I plug the power cord of my laptop after kernel is
> booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz
> which is enough for all my games. But if a game is more demanding then the
> whole system may fail with TV static effect but it can be workarounded by
> sending "manual" to
> /sys/class/drm/card0/device/power_dpm_force_performance_level (or using
> corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5
> in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that
> setting my system is stable.
> 
> TV static is also observable when using suspend2ram on resume.

I don't know if it changes anything but I'm on a desktop system.
Comment 13 Térence Clastres 2019-09-03 10:09:12 UTC
I can still reproduce with linux 5.3-rc7: Setting the memclk to anything higher than 950MHz with a powertable makes it stuck at 800MHz.
Comment 14 haro41 2019-09-03 18:26:24 UTC
(In reply to Térence Clastres from comment #13)
> I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> higher than 950MHz with a powertable makes it stuck at 800MHz.

I had the same issue (5.3-rc3).

Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 (96000->110700) i can run mclk up to 1100MHz, without problems.
Comment 15 Térence Clastres 2019-09-03 19:50:22 UTC
(In reply to haro41 from comment #14)
> (In reply to Térence Clastres from comment #13)
> > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > higher than 950MHz with a powertable makes it stuck at 800MHz.
> 
> I had the same issue (5.3-rc3).
> 
> Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> (96000->110700) i can run mclk up to 1100MHz, without problems.

I can't figure where ucSocClockIndexHigh is in the ppt.
Comment 16 Térence Clastres 2019-09-03 20:25:48 UTC
(In reply to Térence Clastres from comment #15)
> (In reply to haro41 from comment #14)
> > (In reply to Térence Clastres from comment #13)
> > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > 
> > I had the same issue (5.3-rc3).
> > 
> > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > (96000->110700) i can run mclk up to 1100MHz, without problems.
> 
> I can't figure where ucSocClockIndexHigh is in the ppt.

Found it and it works! However after reaching 1095MHz, it falls back to 800MHz but doesn't go below (167MHz or 500Mhz).
Comment 17 haro41 2019-09-06 10:58:25 UTC
(In reply to Térence Clastres from comment #16)
> (In reply to Térence Clastres from comment #15)
> > (In reply to haro41 from comment #14)
> > > (In reply to Térence Clastres from comment #13)
> > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > > 
> > > I had the same issue (5.3-rc3).
> > > 
> > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > > (96000->110700) i can run mclk up to 1100MHz, without problems.
> > 
> > I can't figure where ucSocClockIndexHigh is in the ppt.
> 
> Found it and it works! However after reaching 1095MHz, it falls back to
> 800MHz but doesn't go below (167MHz or 500Mhz).

I remember i adapted my UV values like this:
(cat /sys/class/drm/card0/device/pp_od_clk_voltage)
OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        850mV
2:       1084Mhz        900mV
3:       1138Mhz        910mV
4:       1200Mhz        920mV
5:       1401Mhz        940mV
6:       1536Mhz        950mV
7:       1630Mhz       1100mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        800mV
2:        800Mhz        900mV
3:       1100Mhz        940mV
OD_RANGE:
SCLK:     852MHz       2400MHz
MCLK:     167MHz       1500MHz
VDDC:     800mV        1200mV

Maybe this works for you too.
Comment 18 Térence Clastres 2019-09-06 11:13:56 UTC
(In reply to haro41 from comment #17)
> (In reply to Térence Clastres from comment #16)
> > (In reply to Térence Clastres from comment #15)
> > > (In reply to haro41 from comment #14)
> > > > (In reply to Térence Clastres from comment #13)
> > > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > > > 
> > > > I had the same issue (5.3-rc3).
> > > > 
> > > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > > > (96000->110700) i can run mclk up to 1100MHz, without problems.
> > > 
> > > I can't figure where ucSocClockIndexHigh is in the ppt.
> > 
> > Found it and it works! However after reaching 1095MHz, it falls back to
> > 800MHz but doesn't go below (167MHz or 500Mhz).
> 
> I remember i adapted my UV values like this:
> (cat /sys/class/drm/card0/device/pp_od_clk_voltage)
> OD_SCLK:
> 0:        852Mhz        800mV
> 1:        991Mhz        850mV
> 2:       1084Mhz        900mV
> 3:       1138Mhz        910mV
> 4:       1200Mhz        920mV
> 5:       1401Mhz        940mV
> 6:       1536Mhz        950mV
> 7:       1630Mhz       1100mV
> OD_MCLK:
> 0:        167Mhz        800mV
> 1:        500Mhz        800mV
> 2:        800Mhz        900mV
> 3:       1100Mhz        940mV
> OD_RANGE:
> SCLK:     852MHz       2400MHz
> MCLK:     167MHz       1500MHz
> VDDC:     800mV        1200mV
> 
> Maybe this works for you too.

Thanks, I share very similar values. I thought adjusting OD_MCLK voltages would only set core voltage floor, but I'm not sure what this mean in practice.
Comment 19 haro41 2019-09-06 11:40:46 UTC
> Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> would only set core voltage floor, but I'm not sure what this mean in
> practice.

Yes, the OD_MCLK voltage values are (somehow missleading) actually core voltages linked by indices in MCLK table.
Comment 20 Térence Clastres 2019-09-06 11:42:28 UTC
(In reply to haro41 from comment #19)
> > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > would only set core voltage floor, but I'm not sure what this mean in
> > practice.
> 
> Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> voltages linked by indices in MCLK table.

So why change the default values?
Comment 21 haro41 2019-09-06 12:38:39 UTC
(In reply to Térence Clastres from comment #20)
> (In reply to haro41 from comment #19)
> > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > > would only set core voltage floor, but I'm not sure what this mean in
> > > practice.
> > 
> > Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> > voltages linked by indices in MCLK table.
> 
> So why change the default values?

The memory is always clocked in dependency to the current performance level, 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the driver/smu, which memory clock have to be used for a specific performance level.

So, all you can adjust (in respect to memory clock) is this relation and the memory clocks itself.

Per default for the air cooled RX Vega64, MCLK 2 is used beginning with performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5 (ucVddInd = 5).

The SOCclock must be always above the current MCLK value, and this is where core clock voltage matters! Hence you can't undervolt to much, if you are using higher MCLK (and SOCclock) values!
Comment 22 Térence Clastres 2019-09-08 15:40:37 UTC
(In reply to haro41 from comment #21)
> (In reply to Térence Clastres from comment #20)
> > (In reply to haro41 from comment #19)
> > > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > > > would only set core voltage floor, but I'm not sure what this mean in
> > > > practice.
> > > 
> > > Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> > > voltages linked by indices in MCLK table.
> > 
> > So why change the default values?
> 
> The memory is always clocked in dependency to the current performance level,
> 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the
> driver/smu, which memory clock have to be used for a specific performance
> level.
> 
> So, all you can adjust (in respect to memory clock) is this relation and the
> memory clocks itself.
> 
> Per default for the air cooled RX Vega64, MCLK 2 is used beginning with
> performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5
> (ucVddInd = 5).
> 
> The SOCclock must be always above the current MCLK value, and this is where
> core clock voltage matters! Hence you can't undervolt to much, if you are
> using higher MCLK (and SOCclock) values!

Got it, thank you very much.
Comment 23 Térence Clastres 2019-09-08 15:58:23 UTC
It still doesn't make me understand why after level 7 is reached, MEMCLK doesn't go lower than state 2 when at level 0, with Vddind set to 0
Comment 24 haro41 2019-09-13 13:08:32 UTC
Indeed, that seems buggy. 
Just for verifification: 
Can you attach your modified powerplay table (as binary file)?
Comment 25 Térence Clastres 2019-09-13 13:51:54 UTC
Created attachment 145347 [details]
Tweaked PowerPlay table

Sure, here it is.
Comment 26 haro41 2019-09-13 16:18:20 UTC
I can't reproduce this issue with your pp-table loaded. Works as expected here.
I tried with 5.3.0-rc8 and with head of 'drm-next' branch from '~agd5f/linux'.

Are your amdgpu firmware files up todate?
Comment 27 Térence Clastres 2019-09-14 14:58:53 UTC
(In reply to haro41 from comment #26)
> I can't reproduce this issue with your pp-table loaded. Works as expected
> here.
> I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> '~agd5f/linux'.
> 
> Are your amdgpu firmware files up todate?

How do I check that? 
I should add that I found that it does indeed sometimes go back to 167MHz, but not all the time.
Comment 28 haro41 2019-09-15 19:44:00 UTC
(In reply to Térence Clastres from comment #27)
> (In reply to haro41 from comment #26)
> > I can't reproduce this issue with your pp-table loaded. Works as expected
> > here.
> > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> > '~agd5f/linux'.
> > 
> > Are your amdgpu firmware files up todate?
> 
> How do I check that? 
> I should add that I found that it does indeed sometimes go back to 167MHz,
> but not all the time.

Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and 'vega10*' prefixed files are related to vega 56/64.

Here you can download amdgpu firmware files:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu
Comment 29 Térence Clastres 2019-09-15 19:56:56 UTC
(In reply to haro41 from comment #28)
> (In reply to Térence Clastres from comment #27)
> > (In reply to haro41 from comment #26)
> > > I can't reproduce this issue with your pp-table loaded. Works as expected
> > > here.
> > > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> > > '~agd5f/linux'.
> > > 
> > > Are your amdgpu firmware files up todate?
> > 
> > How do I check that? 
> > I should add that I found that it does indeed sometimes go back to 167MHz,
> > but not all the time.
> 
> Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and
> 'vega10*' prefixed files are related to vega 56/64.
> 
> Here you can download amdgpu firmware files:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
> tree/amdgpu

Thanks, comparing md5sum shows they are up-to-date.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.