Bug 110777

Summary:

Kernel 5.1-5.3 MCLK stuck at 167MHz Vega 10 (56/64)

Product:

DRI

Reporter:

Anton Herzfeld <antonh>

Component:

DRM/AMDgpu

Assignee:

Alex Deucher <alexdeucher>

Status:

RESOLVED MOVED

QA Contact:

Severity:

blocker

Priority:

medium

CC:

alexdeucher, antonh, evan.quan, haro41, rodamorris, samuel, t.clastres, thomas, velemas

Version:

DRI git

Hardware:

x86-64 (AMD64)

OS:

Linux (All)

Whiteboard:

i915 platform:

i915 features:

Attachments:

Description	Flags
Tweaked PowerPlay table	none
dpm varible monitoring script	none
amdgpu-mon.log	none
dpm monitor script	none

Description Anton Herzfeld 2019-05-27 17:56:20 UTC

Hi,

Since Kernel 5.1 up-to Kernel 5.2 my Vega 56 card's memory clock is stuck at 167MHz and does not boost up any more.
The exact same setup boosts fine to 1000MHz memclk when running Kernel 5.0.13.

Is there any info I can provide to get this fixed?

Comment 1 Anton Herzfeld 2019-06-02 17:40:02 UTC

This is still occuring on latest linux master cd6c84d8f0cdc911df435bb075ba22ce3c605b07

Comment 2 Anton Herzfeld 2019-06-03 18:41:33 UTC

The issue is fully fixed on kernel master (currently I am using commit 460b48a0fefce25beb0fc0139e721c5691d65d7f) when reverting drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c back to the state it was around kernel 5.0.13.

https://git.archlinux.org/linux.git/tree/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c?h=v5.0.13-arch1

I will start bisecting soon to figure out the exact commit that has caused the issue.

Comment 3 Anton Herzfeld 2019-06-03 18:55:18 UTC

reverting the following two patches fixes the boost in memory clocks but it seems once mem clock has ramped up it's not going down again.

1.
Revert "drm/amd/powerplay: update soc boot and max level on vega10"   
This reverts commit 373e87fc91527124cb8ec21465a6d070a65c56af.

2.
Revert "drm/amd/powerplay: support Vega10 SOCclk and DCEFclk dpm level settings"
This reverts commit bb05821b13fa0c0b97760cb292b30d3105d65954.

Evan Quan <evan.quan@amd.com>
Alex Deucher <alexander.deucher@amd.com>

Comment 4 Anton Herzfeld 2019-06-03 19:21:53 UTC

Is there anything else I can provide to support getting this fixed?

Comment 5 Anton Herzfeld 2019-06-05 17:44:58 UTC

The following patch fixes the issue with boosting again:

https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce

however it also seems to expose the issue in mclk not going down again once it has boosted.

just to clarify the issue occurs when using manual OD on mclk since kernel 5.1.

Comment 6 Anton Herzfeld 2019-06-13 16:52:07 UTC

@Alex Deucher is there any chance we can get a backport of https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce

into the 5.1 Kernel? This Kernel is broken for Vega 10 otherwise (Kernel 5.2 is also still broken).

Comment 7 haro41 2019-07-11 15:54:02 UTC

I have to confirm this issue with kernel 5.2. 
HBM2 clocks are at 167MHz if i try to overclock memory via write to:
/sys/class/drm/card0/device/pp_od_clk_voltage or
/sys/class/drm/card0/device/pp_table

Bellow is an output of a monitoring script:

gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1625 pwm: 109 pow: 165000000
gpu_vdd: 1100 gpu_clk: 1623000000 mem_clk: 167000000 temp: 39000 fan: 1608 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1637000000 mem_clk: 167000000 temp: 41000 fan: 1603 pwm: 109 pow: 161000000
gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 160000000
gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1618 pwm: 109 pow: 157000000
gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1610 pwm: 109 pow: 159000000
gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 159000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1601 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 162000000
gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 161000000

Comment 8 velemas 2019-07-13 21:37:58 UTC

I confirm I have the same issue on my Acer Predator Helios 500 with Vega 56.

Comment 9 velemas 2019-07-22 19:55:49 UTC

Sometimes MCLK gets stuck on 500Mhz and SCLK on 879MHz. With these clocks after some time under load my laptop makes sound notification as if power cord was disconnected and power led also switches off, then the screen looks like a TV blank screen with white noise. And I have to reboot.

Still reproducible on 5.2.2-arch1.

Comment 10 Térence Clastres 2019-08-24 00:54:54 UTC

Same behaviour using pptables: memory either get stuck to 167MHz (level 0) or 800MHz (level 2), on 5.2 with https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand should pull latest changes to amdgpu.

If using the classic `echo "m 3 200 1050" | sudo tee  /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an absurd memory frequency like 1400MHz which is reported to be used on my different cli tools, but it doesn't look like it does anything.

Comment 11 velemas 2019-08-25 15:44:00 UTC

(In reply to Térence Clastres from comment #10)
> Same behaviour using pptables: memory either get stuck to 167MHz (level 0)
> or 800MHz (level 2), on 5.2 with
> https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand
> should pull latest changes to amdgpu.
> 
> If using the classic `echo "m 3 200 1050" | sudo tee 
> /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an
> absurd memory frequency like 1400MHz which is reported to be used on my
> different cli tools, but it doesn't look like it does anything.

Yes, I also observe sometimes unreal freqs like 2131MHz or something. But I've noticed that when I plug the power cord of my laptop after kernel is booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz which is enough for all my games. But if a game is more demanding then the whole system may fail with TV static effect but it can be workarounded by sending "manual" to /sys/class/drm/card0/device/power_dpm_force_performance_level (or using corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5 in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that setting my system is stable.

TV static is also observable when using suspend2ram on resume.

Comment 12 Térence Clastres 2019-08-25 18:10:23 UTC

(In reply to velemas from comment #11)
> (In reply to Térence Clastres from comment #10)
> > Same behaviour using pptables: memory either get stuck to 167MHz (level 0)
> > or 800MHz (level 2), on 5.2 with
> > https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand
> > should pull latest changes to amdgpu.
> > 
> > If using the classic `echo "m 3 200 1050" | sudo tee 
> > /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an
> > absurd memory frequency like 1400MHz which is reported to be used on my
> > different cli tools, but it doesn't look like it does anything.
> 
> Yes, I also observe sometimes unreal freqs like 2131MHz or something. But
> I've noticed that when I plug the power cord of my laptop after kernel is
> booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz
> which is enough for all my games. But if a game is more demanding then the
> whole system may fail with TV static effect but it can be workarounded by
> sending "manual" to
> /sys/class/drm/card0/device/power_dpm_force_performance_level (or using
> corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5
> in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that
> setting my system is stable.
> 
> TV static is also observable when using suspend2ram on resume.

I don't know if it changes anything but I'm on a desktop system.

Comment 13 Térence Clastres 2019-09-03 10:09:12 UTC

I can still reproduce with linux 5.3-rc7: Setting the memclk to anything higher than 950MHz with a powertable makes it stuck at 800MHz.

Comment 14 haro41 2019-09-03 18:26:24 UTC

(In reply to Térence Clastres from comment #13)
> I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> higher than 950MHz with a powertable makes it stuck at 800MHz.

I had the same issue (5.3-rc3).

Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 (96000->110700) i can run mclk up to 1100MHz, without problems.

Comment 15 Térence Clastres 2019-09-03 19:50:22 UTC

(In reply to haro41 from comment #14)
> (In reply to Térence Clastres from comment #13)
> > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > higher than 950MHz with a powertable makes it stuck at 800MHz.
> 
> I had the same issue (5.3-rc3).
> 
> Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> (96000->110700) i can run mclk up to 1100MHz, without problems.

I can't figure where ucSocClockIndexHigh is in the ppt.

Comment 16 Térence Clastres 2019-09-03 20:25:48 UTC

(In reply to Térence Clastres from comment #15)
> (In reply to haro41 from comment #14)
> > (In reply to Térence Clastres from comment #13)
> > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > 
> > I had the same issue (5.3-rc3).
> > 
> > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > (96000->110700) i can run mclk up to 1100MHz, without problems.
> 
> I can't figure where ucSocClockIndexHigh is in the ppt.

Found it and it works! However after reaching 1095MHz, it falls back to 800MHz but doesn't go below (167MHz or 500Mhz).

Comment 17 haro41 2019-09-06 10:58:25 UTC

(In reply to Térence Clastres from comment #16)
> (In reply to Térence Clastres from comment #15)
> > (In reply to haro41 from comment #14)
> > > (In reply to Térence Clastres from comment #13)
> > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > > 
> > > I had the same issue (5.3-rc3).
> > > 
> > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > > (96000->110700) i can run mclk up to 1100MHz, without problems.
> > 
> > I can't figure where ucSocClockIndexHigh is in the ppt.
> 
> Found it and it works! However after reaching 1095MHz, it falls back to
> 800MHz but doesn't go below (167MHz or 500Mhz).

I remember i adapted my UV values like this:
(cat /sys/class/drm/card0/device/pp_od_clk_voltage)
OD_SCLK:
0:        852Mhz        800mV
1:        991Mhz        850mV
2:       1084Mhz        900mV
3:       1138Mhz        910mV
4:       1200Mhz        920mV
5:       1401Mhz        940mV
6:       1536Mhz        950mV
7:       1630Mhz       1100mV
OD_MCLK:
0:        167Mhz        800mV
1:        500Mhz        800mV
2:        800Mhz        900mV
3:       1100Mhz        940mV
OD_RANGE:
SCLK:     852MHz       2400MHz
MCLK:     167MHz       1500MHz
VDDC:     800mV        1200mV

Maybe this works for you too.

Comment 18 Térence Clastres 2019-09-06 11:13:56 UTC

(In reply to haro41 from comment #17)
> (In reply to Térence Clastres from comment #16)
> > (In reply to Térence Clastres from comment #15)
> > > (In reply to haro41 from comment #14)
> > > > (In reply to Térence Clastres from comment #13)
> > > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything
> > > > > higher than 950MHz with a powertable makes it stuck at 800MHz.
> > > > 
> > > > I had the same issue (5.3-rc3).
> > > > 
> > > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7
> > > > (96000->110700) i can run mclk up to 1100MHz, without problems.
> > > 
> > > I can't figure where ucSocClockIndexHigh is in the ppt.
> > 
> > Found it and it works! However after reaching 1095MHz, it falls back to
> > 800MHz but doesn't go below (167MHz or 500Mhz).
> 
> I remember i adapted my UV values like this:
> (cat /sys/class/drm/card0/device/pp_od_clk_voltage)
> OD_SCLK:
> 0:        852Mhz        800mV
> 1:        991Mhz        850mV
> 2:       1084Mhz        900mV
> 3:       1138Mhz        910mV
> 4:       1200Mhz        920mV
> 5:       1401Mhz        940mV
> 6:       1536Mhz        950mV
> 7:       1630Mhz       1100mV
> OD_MCLK:
> 0:        167Mhz        800mV
> 1:        500Mhz        800mV
> 2:        800Mhz        900mV
> 3:       1100Mhz        940mV
> OD_RANGE:
> SCLK:     852MHz       2400MHz
> MCLK:     167MHz       1500MHz
> VDDC:     800mV        1200mV
> 
> Maybe this works for you too.

Thanks, I share very similar values. I thought adjusting OD_MCLK voltages would only set core voltage floor, but I'm not sure what this mean in practice.

Comment 19 haro41 2019-09-06 11:40:46 UTC

> Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> would only set core voltage floor, but I'm not sure what this mean in
> practice.

Yes, the OD_MCLK voltage values are (somehow missleading) actually core voltages linked by indices in MCLK table.

Comment 20 Térence Clastres 2019-09-06 11:42:28 UTC

(In reply to haro41 from comment #19)
> > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > would only set core voltage floor, but I'm not sure what this mean in
> > practice.
> 
> Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> voltages linked by indices in MCLK table.

So why change the default values?

Comment 21 haro41 2019-09-06 12:38:39 UTC

(In reply to Térence Clastres from comment #20)
> (In reply to haro41 from comment #19)
> > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > > would only set core voltage floor, but I'm not sure what this mean in
> > > practice.
> > 
> > Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> > voltages linked by indices in MCLK table.
> 
> So why change the default values?

The memory is always clocked in dependency to the current performance level, 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the driver/smu, which memory clock have to be used for a specific performance level.

So, all you can adjust (in respect to memory clock) is this relation and the memory clocks itself.

Per default for the air cooled RX Vega64, MCLK 2 is used beginning with performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5 (ucVddInd = 5).

The SOCclock must be always above the current MCLK value, and this is where core clock voltage matters! Hence you can't undervolt to much, if you are using higher MCLK (and SOCclock) values!

Comment 22 Térence Clastres 2019-09-08 15:40:37 UTC

(In reply to haro41 from comment #21)
> (In reply to Térence Clastres from comment #20)
> > (In reply to haro41 from comment #19)
> > > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> > > > would only set core voltage floor, but I'm not sure what this mean in
> > > > practice.
> > > 
> > > Yes, the OD_MCLK voltage values are (somehow missleading) actually core
> > > voltages linked by indices in MCLK table.
> > 
> > So why change the default values?
> 
> The memory is always clocked in dependency to the current performance level,
> 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the
> driver/smu, which memory clock have to be used for a specific performance
> level.
> 
> So, all you can adjust (in respect to memory clock) is this relation and the
> memory clocks itself.
> 
> Per default for the air cooled RX Vega64, MCLK 2 is used beginning with
> performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5
> (ucVddInd = 5).
> 
> The SOCclock must be always above the current MCLK value, and this is where
> core clock voltage matters! Hence you can't undervolt to much, if you are
> using higher MCLK (and SOCclock) values!

Got it, thank you very much.

Comment 23 Térence Clastres 2019-09-08 15:58:23 UTC

It still doesn't make me understand why after level 7 is reached, MEMCLK doesn't go lower than state 2 when at level 0, with Vddind set to 0

Comment 24 haro41 2019-09-13 13:08:32 UTC

Indeed, that seems buggy. 
Just for verifification: 
Can you attach your modified powerplay table (as binary file)?

Comment 25 Térence Clastres 2019-09-13 13:51:54 UTC

Created attachment 145347 [details]
Tweaked PowerPlay table

Sure, here it is.

Comment 26 haro41 2019-09-13 16:18:20 UTC

I can't reproduce this issue with your pp-table loaded. Works as expected here.
I tried with 5.3.0-rc8 and with head of 'drm-next' branch from '~agd5f/linux'.

Are your amdgpu firmware files up todate?

Comment 27 Térence Clastres 2019-09-14 14:58:53 UTC

(In reply to haro41 from comment #26)
> I can't reproduce this issue with your pp-table loaded. Works as expected
> here.
> I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> '~agd5f/linux'.
> 
> Are your amdgpu firmware files up todate?

How do I check that? 
I should add that I found that it does indeed sometimes go back to 167MHz, but not all the time.

Comment 28 haro41 2019-09-15 19:44:00 UTC

(In reply to Térence Clastres from comment #27)
> (In reply to haro41 from comment #26)
> > I can't reproduce this issue with your pp-table loaded. Works as expected
> > here.
> > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> > '~agd5f/linux'.
> > 
> > Are your amdgpu firmware files up todate?
> 
> How do I check that? 
> I should add that I found that it does indeed sometimes go back to 167MHz,
> but not all the time.

Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and 'vega10*' prefixed files are related to vega 56/64.

Here you can download amdgpu firmware files:
https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu

Comment 29 Térence Clastres 2019-09-15 19:56:56 UTC

(In reply to haro41 from comment #28)
> (In reply to Térence Clastres from comment #27)
> > (In reply to haro41 from comment #26)
> > > I can't reproduce this issue with your pp-table loaded. Works as expected
> > > here.
> > > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from
> > > '~agd5f/linux'.
> > > 
> > > Are your amdgpu firmware files up todate?
> > 
> > How do I check that? 
> > I should add that I found that it does indeed sometimes go back to 167MHz,
> > but not all the time.
> 
> Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and
> 'vega10*' prefixed files are related to vega 56/64.
> 
> Here you can download amdgpu firmware files:
> https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/
> tree/amdgpu

Thanks, comparing md5sum shows they are up-to-date.

Comment 30 velemas 2019-09-20 13:36:06 UTC

Still the same with 5.3 on Arch.

Comment 31 Adam 2019-10-17 16:42:09 UTC

I have this issue with vega 56 (red dragon from powercolor). I wonder if it occurs on vega 64 too.

What I have found, is that in applications like blender or even Superposition Benchmark (https://benchmark.unigine.com/superposition) memory clocks boost up correctly, only when trying to launch any game (from steam, like dota 2) or just standalone one (Pillars from eternity, GOG installer), I see some spikes at 700MHz (but only at the beginning, maybe just software read error) and then memory is not going higher than 167Mhz.

I see some answers but I am not sure, did anyone found working workaround?

Comment 32 velemas 2019-10-17 20:15:51 UTC

(In reply to Adam from comment #31)
> I have this issue with vega 56 (red dragon from powercolor). I wonder if it
> occurs on vega 64 too.
> 
> What I have found, is that in applications like blender or even
> Superposition Benchmark (https://benchmark.unigine.com/superposition) memory
> clocks boost up correctly, only when trying to launch any game (from steam,
> like dota 2) or just standalone one (Pillars from eternity, GOG installer),
> I see some spikes at 700MHz (but only at the beginning, maybe just software
> read error) and then memory is not going higher than 167Mhz.
> 
> I see some answers but I am not sure, did anyone found working workaround?

For me workaround is amdgpu.dpm=0 in kernel parameters. It switches off DPM but performance is ok (don't know what freqs it has in that case) as well as temperature (no much noise from fans).

Comment 33 haro41 2019-10-18 09:35:43 UTC

(In reply to Adam from comment #31)
> I have this issue with vega 56 (red dragon from powercolor). I wonder if it
> occurs on vega 64 too.
> 
> What I have found, is that in applications like blender or even
> Superposition Benchmark (https://benchmark.unigine.com/superposition) memory
> clocks boost up correctly, only when trying to launch any game (from steam,
> like dota 2) or just standalone one (Pillars from eternity, GOG installer),
> I see some spikes at 700MHz (but only at the beginning, maybe just software
> read error) and then memory is not going higher than 167Mhz.
> 
> I see some answers but I am not sure, did anyone found working workaround?

I see this behavoir on 3D load while VSYNC is enabled. I think in this case, the load is not large enough to trigger an permanent switch to a higher dpm level.

The problem on my system is: with very low load it seems the ULV mode (0.800-0.050V=0.750) stays active and my GPU crashes sometimes.

Comment 34 haro41 2019-10-18 09:38:49 UTC

... i forgot to mention, i am on kernel 5.3.1 meanwhile and this happens with and without any pp_table modification ...

Comment 35 haro41 2019-10-18 15:36:13 UTC

While debugging vega 10 powerplay, i found some code that looks fishy and could be related:

from 'vega10_hwmgr.c':

#define HBM_MEMORY_CHANNEL_WIDTH    128
static const uint32_t channel_number[] = {1, 2, 0, 4, 0, 8, 0, 16, 2};
...

static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr)
{
	struct vega10_hwmgr *data = hwmgr->backend;
	PPTable_t *pp_table = &(data->smc_state_table.pp_table);
...
	pp_table->NumMemoryChannels = (uint16_t)(data->mem_channels);
	pp_table->MemoryChannelWidth =
	        (uint16_t)(HBM_MEMORY_CHANNEL_WIDTH *
	                   channel_number[data->mem_channels]);
...
}

Debugging gives:
NumMemoryChannels: 7
MemoryChannelWidth: 2048

Since 'data->mem_channels' is obviously meant as an index in 'channel_number[]', i think it is very unlikely, this code does what it was meant to do.


Maybe it was more meant like this?:

static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr)
{
	struct vega10_hwmgr *data = hwmgr->backend;
	PPTable_t *pp_table = &(data->smc_state_table.pp_table);
...

	pp_table->NumMemoryChannels = channel_number[data->mem_channels];
	pp_table->MemoryChannelWidth = HBM_MEMORY_CHANNEL_WIDTH;
...
}

Debugging gives:
NumMemoryChannels: 16
MemoryChannelWidth: 128


If this is a real bug, it could be related to this thread, because the smc firmware will always calculate an higher available memory bandwidth and hence switches memory clock to low/late.

Comment 36 velemas 2019-10-18 17:01:39 UTC

(In reply to haro41 from comment #35)
> While debugging vega 10 powerplay, i found some code that looks fishy and
> could be related:
> 
> from 'vega10_hwmgr.c':
> 
> #define HBM_MEMORY_CHANNEL_WIDTH    128
> static const uint32_t channel_number[] = {1, 2, 0, 4, 0, 8, 0, 16, 2};
> ...
> 
> static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr)
> {
> 	struct vega10_hwmgr *data = hwmgr->backend;
> 	PPTable_t *pp_table = &(data->smc_state_table.pp_table);
> ...
> 	pp_table->NumMemoryChannels = (uint16_t)(data->mem_channels);
> 	pp_table->MemoryChannelWidth =
> 	        (uint16_t)(HBM_MEMORY_CHANNEL_WIDTH *
> 	                   channel_number[data->mem_channels]);
> ...
> }
> 
> Debugging gives:
> NumMemoryChannels: 7
> MemoryChannelWidth: 2048
> 
> Since 'data->mem_channels' is obviously meant as an index in
> 'channel_number[]', i think it is very unlikely, this code does what it was
> meant to do.
> 
> 
> Maybe it was more meant like this?:
> 
> static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr)
> {
> 	struct vega10_hwmgr *data = hwmgr->backend;
> 	PPTable_t *pp_table = &(data->smc_state_table.pp_table);
> ...
> 
> 	pp_table->NumMemoryChannels = channel_number[data->mem_channels];
> 	pp_table->MemoryChannelWidth = HBM_MEMORY_CHANNEL_WIDTH;
> ...
> }
> 
> Debugging gives:
> NumMemoryChannels: 16
> MemoryChannelWidth: 128
> 
> 
> If this is a real bug, it could be related to this thread, because the smc
> firmware will always calculate an higher available memory bandwidth and
> hence switches memory clock to low/late.

Did you try to test this? Afair Vega has exactly 2048 bit channel width. Otoh why do they calculate them during runtime? Or in a lower powerstate the bandwidth is reduced either (part of the bus is switched off)?

Comment 37 haro41 2019-10-18 17:17:17 UTC

Yes, the suggested code works, but i didn't test it much yet. I would say it switches earlier to higher memory clock levels.

Vega 10 has 16 channels, 128bit each, makes 2048bit total.
The available bandwith depends of total width and the current memory clock.

So the smc can calculate the available bandwidth in realtime and adapt it to the current requirements by switching memory clock level up and down.

Comment 38 velemas 2019-10-18 17:49:07 UTC

(In reply to haro41 from comment #37)
> Yes, the suggested code works, but i didn't test it much yet. I would say it
> switches earlier to higher memory clock levels.
> 
> Vega 10 has 16 channels, 128bit each, makes 2048bit total.
> The available bandwith depends of total width and the current memory clock.
> 
> So the smc can calculate the available bandwidth in realtime and adapt it to
> the current requirements by switching memory clock level up and down.

Thank you, I will compile it myself and test.

Comment 39 haro41 2019-10-18 20:37:24 UTC

I tested the proposed modification a bit more and find confirmed, that this values are affecting the memory clock level switching.

For more significant test results, you can simple set 'pp_table->NumMemoryChannels to 1, like this:

pp_table->NumMemoryChannels = 1;

Comment 40 velemas 2019-10-18 23:37:34 UTC

(In reply to haro41 from comment #39)
> I tested the proposed modification a bit more and find confirmed, that this
> values are affecting the memory clock level switching.
> 
> For more significant test results, you can simple set
> 'pp_table->NumMemoryChannels to 1, like this:
> 
> pp_table->NumMemoryChannels = 1;

In my case it just stuck at 500Mhz. pp_dpm_socclk should be adjusted manually too otherwise with some load I just see TV static again.

Comment 41 haro41 2019-10-21 16:07:12 UTC

... after some more debugging, i found this code:


static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
                bool has_disp)
{
	smum_send_msg_to_smc_with_parameter(hwmgr,
	                                    PPSMC_MSG_SetUclkFastSwitch,
	                                    has_disp ? 1 : 0);
}

This function is called very early at driver initialization and as the names are suggesting, this code sends a message to smc to enable fast switching between the memory clock levels. 

This fast switching of memory clock levels seems to cause some problems, especially if the GPU load is limited (VSYNC on).
As mentioned, i get crashes (with VSYNC enabled) sometimes and while a game is running the device spends some time in ULV (ULV: mclk=167Mhz vdd:0.750).

I finally tried the following fix:

static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
                bool has_disp)
{
  smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, 0);
  smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100);

}

The first smc command disables fast mclk switching, the second sets a hysterese time of 100(ms?) for the mclk level down event.

This fixed the crashing and the mclk levels are significantly more persistant (no ULV state while a game is running).

I think there is ether a bug in the smc firmware in respect to fast mclk level switching, or the ULV mode and the fast mclk level switching are not meant to be used simultaneously.

Comment 42 haro41 2019-10-21 16:27:10 UTC

... this bug could be related too:

https://bugs.freedesktop.org/show_bug.cgi?id=109955

Comment 43 velemas 2019-10-21 18:47:22 UTC

(In reply to haro41 from comment #41)
> ... after some more debugging, i found this code:
> 
> 
> static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
>                 bool has_disp)
> {
> 	smum_send_msg_to_smc_with_parameter(hwmgr,
> 	                                    PPSMC_MSG_SetUclkFastSwitch,
> 	                                    has_disp ? 1 : 0);
> }
> 
> This function is called very early at driver initialization and as the names
> are suggesting, this code sends a message to smc to enable fast switching
> between the memory clock levels. 
> 
> This fast switching of memory clock levels seems to cause some problems,
> especially if the GPU load is limited (VSYNC on).
> As mentioned, i get crashes (with VSYNC enabled) sometimes and while a game
> is running the device spends some time in ULV (ULV: mclk=167Mhz vdd:0.750).
> 
> I finally tried the following fix:
> 
> static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
>                 bool has_disp)
> {
>   smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, 0);
>   smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100);
> 
> }
> 
> The first smc command disables fast mclk switching, the second sets a
> hysterese time of 100(ms?) for the mclk level down event.
> 
> This fixed the crashing and the mclk levels are significantly more
> persistant (no ULV state while a game is running).
> 
> I think there is ether a bug in the smc firmware in respect to fast mclk
> level switching, or the ULV mode and the fast mclk level switching are not
> meant to be used simultaneously.

It has changed nothing for me but i have a laptop system. Maybe there is something different.

On my system the problem affects CPU freqs too. CPU also stuck at minimum 548Mhz regardless of the performance freq governor (maybe because of amdgpu.bapm but setting it to 0 does not change anything). Only dpm=0 helps.

I don't think it is a fw problem since on Win 10 I don't have the issue (i guess fw blob is the same on both systems). The only problem I have both on Linux and Win is garbage on the screen after suspend-to-ram.

Comment 44 haro41 2019-10-21 19:41:39 UTC

I am on a desktop system and the only issue is this affinity to 167Mhz mclk and ULV, while 3D load is present (VSYNC enabled). Some crashes because of this.

I think on your laptop, you will see a superposition of this issue and probably some additional issues.

Can you try the following:
compile your kernel with CONFIG_DYNAMIC_DEBUG=y
add 'amdgpu.dyndbg=+pf log_buf_len=1M' to your kernel parameters.

Print kernel log via 'dmesg |grep DCEFCLK!'.

Can you see a message like this:

... amdgpu: [powerplay] Cannot find requested DCEFCLK!

Comment 45 velemas 2019-10-21 22:00:14 UTC

(In reply to haro41 from comment #44)
> I am on a desktop system and the only issue is this affinity to 167Mhz mclk
> and ULV, while 3D load is present (VSYNC enabled). Some crashes because of
> this.
> 
> I think on your laptop, you will see a superposition of this issue and
> probably some additional issues.
> 
> Can you try the following:
> compile your kernel with CONFIG_DYNAMIC_DEBUG=y
> add 'amdgpu.dyndbg=+pf log_buf_len=1M' to your kernel parameters.
> 
> Print kernel log via 'dmesg |grep DCEFCLK!'.
> 
> Can you see a message like this:
> 
> ... amdgpu: [powerplay] Cannot find requested DCEFCLK!

Yes exactly:
[    2.866153] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK!
[    2.867413] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK!
[   25.677850] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK!
[   26.305300] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK!
[   26.640150] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK!

Comment 46 haro41 2019-10-22 07:55:15 UTC

Thank you, so it is not only on my system.
While this message is not necessarily related, perhaps it is worth to be investigated it a bit more.

Comment 47 haro41 2019-10-26 18:54:36 UTC

I investigated the behavior of SetUclkFastSwitch and SetUclkDownHyst commands (see comment #41) a bit more and have to add, the hysterese parameter is the only one that makes the difference and it seems to be a logic switch only.

So the following additional line in function 'vega10_notify_smc_display_change()' fixes the crashes for me:

static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr,
                bool has_disp)
{
 smum_send_msg_to_smc_with_parameter(hwmgr,
	                                    PPSMC_MSG_SetUclkFastSwitch,
	                                    has_disp ? 1 : 0);

 /* enable hysterese for mclk switching, to workaround crashes */
 smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100);

}

(This is with one display connected only)

Comment 48 haro41 2019-10-26 18:59:43 UTC

... kernel 5.3.1 here and the parameter for PPSMC_MSG_SetUclkDownHyst can be anything above 0 ...

Comment 49 velemas 2019-10-26 23:46:03 UTC

I can confirm that with this change crash is gone (kernel 5.3.7) and i even can set higher freqs in manual mode. But automatic freq scaling still does not work (freq is stuck at 167MHz).

Comment 50 haro41 2019-10-27 07:04:22 UTC

(In reply to velemas from comment #49)
> I can confirm that with this change crash is gone (kernel 5.3.7) and i even
> can set higher freqs in manual mode. But automatic freq scaling still does
> not work (freq is stuck at 167MHz).

mclk level switching happens very fast to get a chance to observe it, while VSYNC is activarted you will need a monitoring script. 

Maybe you try this one i attached, it have to be run as root.

Comment 51 haro41 2019-10-27 07:06:41 UTC

Created attachment 145828 [details]
dpm varible  monitoring script

run this with root pivileges and adapt the sampling interval as needed

Comment 52 velemas 2019-10-27 09:31:13 UTC

Created attachment 145829 [details]
amdgpu-mon.log

I changed hwmon0 to hwmon1 in your script because for some reason my system has only that directory (i have only 1 discrete GPU, no APU).

I tried to play Pine in automatic mode with your script running. The performance was really bad.

wc -l amdgpu-mon.log
3065 amdgpu-mon.log

`grep -v "mclk: 167" amdgpu-mon.log` gives only few entries:
sclk: -1945 mclk: 500 vdd: 837
sclk: -1948 mclk: 500 vdd: 837
sclk: -1903 mclk: 500 vdd: 843
sclk: -1949 mclk: 500 vdd: 812
sclk: -1943 mclk: 500 vdd: 837
sclk: -1944 mclk: 500 vdd: 837
sclk: -1952 mclk: 500 vdd: 837
sclk: -1953 mclk: 500 vdd: 812
sclk: -1952 mclk: 500 vdd: 837
sclk: -1953 mclk: 500 vdd: 812
sclk: -1953 mclk: 500 vdd: 837
sclk: -1965 mclk: 500 vdd: 812
sclk: -1963 mclk: 500 vdd: 837
sclk: -1962 mclk: 500 vdd: 837
sclk: -1966 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 837
sclk: -1972 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 837

And sclk has unreal values.

Comment 53 haro41 2019-10-28 22:08:11 UTC

Do you see better performance in prime with mclk level forced to high?

echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk
 
Do you see such negative values for sclk via direct read from sysfs too?

Comment 54 haro41 2019-10-28 22:14:37 UTC

Created attachment 145838 [details]
dpm monitor script

Comment 55 velemas 2019-10-29 00:19:13 UTC

(In reply to haro41 from comment #53)
> Do you see better performance in prime with mclk level forced to high?
> 
> echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk
>  
> Do you see such negative values for sclk via direct read from sysfs too?

Please see my comment 11. Nothing has changed since then.

I can set higher sclk and mclk levels with good performance but only when power cord is plugged after kernel is booted (power led is on otherwise the led is switched off once kernel is booted). But switching levels back and forth does not work.

I am investigating ac[dc]_power in vega10_hwmgr.c.

Comment 56 Martin Peres 2019-11-19 09:29:26 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/801.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.