Summary: | Kernel 5.1-5.3 MCLK stuck at 167MHz Vega 10 (56/64) | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Anton Herzfeld <antonh> | ||||||||||
Component: | DRM/AMDgpu | Assignee: | Alex Deucher <alexdeucher> | ||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||
Severity: | blocker | ||||||||||||
Priority: | medium | CC: | alexdeucher, antonh, evan.quan, haro41, rodamorris, samuel, t.clastres, thomas, velemas | ||||||||||
Version: | DRI git | ||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Description
Anton Herzfeld
2019-05-27 17:56:20 UTC
This is still occuring on latest linux master cd6c84d8f0cdc911df435bb075ba22ce3c605b07 The issue is fully fixed on kernel master (currently I am using commit 460b48a0fefce25beb0fc0139e721c5691d65d7f) when reverting drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c back to the state it was around kernel 5.0.13. https://git.archlinux.org/linux.git/tree/drivers/gpu/drm/amd/powerplay/hwmgr/vega10_hwmgr.c?h=v5.0.13-arch1 I will start bisecting soon to figure out the exact commit that has caused the issue. reverting the following two patches fixes the boost in memory clocks but it seems once mem clock has ramped up it's not going down again. 1. Revert "drm/amd/powerplay: update soc boot and max level on vega10" This reverts commit 373e87fc91527124cb8ec21465a6d070a65c56af. 2. Revert "drm/amd/powerplay: support Vega10 SOCclk and DCEFclk dpm level settings" This reverts commit bb05821b13fa0c0b97760cb292b30d3105d65954. Evan Quan <evan.quan@amd.com> Alex Deucher <alexander.deucher@amd.com> Is there anything else I can provide to support getting this fixed? The following patch fixes the issue with boosting again: https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce however it also seems to expose the issue in mclk not going down again once it has boosted. just to clarify the issue occurs when using manual OD on mclk since kernel 5.1. @Alex Deucher is there any chance we can get a backport of https://cgit.freedesktop.org/~agd5f/linux/commit/drivers/gpu/drm/amd/powerplay/hwmgr?h=amd-staging-drm-next&id=7d59c41b5150d0641203f91cfcaa0f9af5999cce into the 5.1 Kernel? This Kernel is broken for Vega 10 otherwise (Kernel 5.2 is also still broken). I have to confirm this issue with kernel 5.2. HBM2 clocks are at 167MHz if i try to overclock memory via write to: /sys/class/drm/card0/device/pp_od_clk_voltage or /sys/class/drm/card0/device/pp_table Bellow is an output of a monitoring script: gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1625 pwm: 109 pow: 165000000 gpu_vdd: 1100 gpu_clk: 1623000000 mem_clk: 167000000 temp: 39000 fan: 1608 pwm: 109 pow: 162000000 gpu_vdd: 1100 gpu_clk: 1637000000 mem_clk: 167000000 temp: 41000 fan: 1603 pwm: 109 pow: 161000000 gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 160000000 gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1618 pwm: 109 pow: 157000000 gpu_vdd: 1100 gpu_clk: 1640000000 mem_clk: 167000000 temp: 41000 fan: 1610 pwm: 109 pow: 159000000 gpu_vdd: 1100 gpu_clk: 1639000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 159000000 gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 40000 fan: 1601 pwm: 109 pow: 162000000 gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 42000 fan: 1603 pwm: 109 pow: 162000000 gpu_vdd: 1100 gpu_clk: 1638000000 mem_clk: 167000000 temp: 41000 fan: 1596 pwm: 109 pow: 161000000 I confirm I have the same issue on my Acer Predator Helios 500 with Vega 56. Sometimes MCLK gets stuck on 500Mhz and SCLK on 879MHz. With these clocks after some time under load my laptop makes sound notification as if power cord was disconnected and power led also switches off, then the screen looks like a TV blank screen with white noise. And I have to reboot. Still reproducible on 5.2.2-arch1. Same behaviour using pptables: memory either get stuck to 167MHz (level 0) or 800MHz (level 2), on 5.2 with https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand should pull latest changes to amdgpu. If using the classic `echo "m 3 200 1050" | sudo tee /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an absurd memory frequency like 1400MHz which is reported to be used on my different cli tools, but it doesn't look like it does anything. (In reply to Térence Clastres from comment #10) > Same behaviour using pptables: memory either get stuck to 167MHz (level 0) > or 800MHz (level 2), on 5.2 with > https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand > should pull latest changes to amdgpu. > > If using the classic `echo "m 3 200 1050" | sudo tee > /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an > absurd memory frequency like 1400MHz which is reported to be used on my > different cli tools, but it doesn't look like it does anything. Yes, I also observe sometimes unreal freqs like 2131MHz or something. But I've noticed that when I plug the power cord of my laptop after kernel is booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz which is enough for all my games. But if a game is more demanding then the whole system may fail with TV static effect but it can be workarounded by sending "manual" to /sys/class/drm/card0/device/power_dpm_force_performance_level (or using corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5 in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that setting my system is stable. TV static is also observable when using suspend2ram on resume. (In reply to velemas from comment #11) > (In reply to Térence Clastres from comment #10) > > Same behaviour using pptables: memory either get stuck to 167MHz (level 0) > > or 800MHz (level 2), on 5.2 with > > https://aur.archlinux.org/packages/amdgpu-dkms/ which from what I understand > > should pull latest changes to amdgpu. > > > > If using the classic `echo "m 3 200 1050" | sudo tee > > /sys/class/drm/card0/device/pp_od_clk_voltage` I found myself able to set an > > absurd memory frequency like 1400MHz which is reported to be used on my > > different cli tools, but it doesn't look like it does anything. > > Yes, I also observe sometimes unreal freqs like 2131MHz or something. But > I've noticed that when I plug the power cord of my laptop after kernel is > booted in the bootloader then MCLK is set at level 500MHz and SCLK is 879MHz > which is enough for all my games. But if a game is more demanding then the > whole system may fail with TV static effect but it can be workarounded by > sending "manual" to > /sys/class/drm/card0/device/power_dpm_force_performance_level (or using > corectrl https://aur.archlinux.org/packages/corectrl/) and setting level 5 > in /sys/class/drm/card0/device/pp_dpm_socclk which is 847Mhz. With that > setting my system is stable. > > TV static is also observable when using suspend2ram on resume. I don't know if it changes anything but I'm on a desktop system. I can still reproduce with linux 5.3-rc7: Setting the memclk to anything higher than 950MHz with a powertable makes it stuck at 800MHz. (In reply to Térence Clastres from comment #13) > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything > higher than 950MHz with a powertable makes it stuck at 800MHz. I had the same issue (5.3-rc3). Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 (96000->110700) i can run mclk up to 1100MHz, without problems. (In reply to haro41 from comment #14) > (In reply to Térence Clastres from comment #13) > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything > > higher than 950MHz with a powertable makes it stuck at 800MHz. > > I had the same issue (5.3-rc3). > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 > (96000->110700) i can run mclk up to 1100MHz, without problems. I can't figure where ucSocClockIndexHigh is in the ppt. (In reply to Térence Clastres from comment #15) > (In reply to haro41 from comment #14) > > (In reply to Térence Clastres from comment #13) > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything > > > higher than 950MHz with a powertable makes it stuck at 800MHz. > > > > I had the same issue (5.3-rc3). > > > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 > > (96000->110700) i can run mclk up to 1100MHz, without problems. > > I can't figure where ucSocClockIndexHigh is in the ppt. Found it and it works! However after reaching 1095MHz, it falls back to 800MHz but doesn't go below (167MHz or 500Mhz). (In reply to Térence Clastres from comment #16) > (In reply to Térence Clastres from comment #15) > > (In reply to haro41 from comment #14) > > > (In reply to Térence Clastres from comment #13) > > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything > > > > higher than 950MHz with a powertable makes it stuck at 800MHz. > > > > > > I had the same issue (5.3-rc3). > > > > > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 > > > (96000->110700) i can run mclk up to 1100MHz, without problems. > > > > I can't figure where ucSocClockIndexHigh is in the ppt. > > Found it and it works! However after reaching 1095MHz, it falls back to > 800MHz but doesn't go below (167MHz or 500Mhz). I remember i adapted my UV values like this: (cat /sys/class/drm/card0/device/pp_od_clk_voltage) OD_SCLK: 0: 852Mhz 800mV 1: 991Mhz 850mV 2: 1084Mhz 900mV 3: 1138Mhz 910mV 4: 1200Mhz 920mV 5: 1401Mhz 940mV 6: 1536Mhz 950mV 7: 1630Mhz 1100mV OD_MCLK: 0: 167Mhz 800mV 1: 500Mhz 800mV 2: 800Mhz 900mV 3: 1100Mhz 940mV OD_RANGE: SCLK: 852MHz 2400MHz MCLK: 167MHz 1500MHz VDDC: 800mV 1200mV Maybe this works for you too. (In reply to haro41 from comment #17) > (In reply to Térence Clastres from comment #16) > > (In reply to Térence Clastres from comment #15) > > > (In reply to haro41 from comment #14) > > > > (In reply to Térence Clastres from comment #13) > > > > > I can still reproduce with linux 5.3-rc7: Setting the memclk to anything > > > > > higher than 950MHz with a powertable makes it stuck at 800MHz. > > > > > > > > I had the same issue (5.3-rc3). > > > > > > > > Since i changed the value 'ucSocClockIndexHigh' in 'state 1', from 5->7 > > > > (96000->110700) i can run mclk up to 1100MHz, without problems. > > > > > > I can't figure where ucSocClockIndexHigh is in the ppt. > > > > Found it and it works! However after reaching 1095MHz, it falls back to > > 800MHz but doesn't go below (167MHz or 500Mhz). > > I remember i adapted my UV values like this: > (cat /sys/class/drm/card0/device/pp_od_clk_voltage) > OD_SCLK: > 0: 852Mhz 800mV > 1: 991Mhz 850mV > 2: 1084Mhz 900mV > 3: 1138Mhz 910mV > 4: 1200Mhz 920mV > 5: 1401Mhz 940mV > 6: 1536Mhz 950mV > 7: 1630Mhz 1100mV > OD_MCLK: > 0: 167Mhz 800mV > 1: 500Mhz 800mV > 2: 800Mhz 900mV > 3: 1100Mhz 940mV > OD_RANGE: > SCLK: 852MHz 2400MHz > MCLK: 167MHz 1500MHz > VDDC: 800mV 1200mV > > Maybe this works for you too. Thanks, I share very similar values. I thought adjusting OD_MCLK voltages would only set core voltage floor, but I'm not sure what this mean in practice. > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages
> would only set core voltage floor, but I'm not sure what this mean in
> practice.
Yes, the OD_MCLK voltage values are (somehow missleading) actually core voltages linked by indices in MCLK table.
(In reply to haro41 from comment #19) > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages > > would only set core voltage floor, but I'm not sure what this mean in > > practice. > > Yes, the OD_MCLK voltage values are (somehow missleading) actually core > voltages linked by indices in MCLK table. So why change the default values? (In reply to Térence Clastres from comment #20) > (In reply to haro41 from comment #19) > > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages > > > would only set core voltage floor, but I'm not sure what this mean in > > > practice. > > > > Yes, the OD_MCLK voltage values are (somehow missleading) actually core > > voltages linked by indices in MCLK table. > > So why change the default values? The memory is always clocked in dependency to the current performance level, 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the driver/smu, which memory clock have to be used for a specific performance level. So, all you can adjust (in respect to memory clock) is this relation and the memory clocks itself. Per default for the air cooled RX Vega64, MCLK 2 is used beginning with performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5 (ucVddInd = 5). The SOCclock must be always above the current MCLK value, and this is where core clock voltage matters! Hence you can't undervolt to much, if you are using higher MCLK (and SOCclock) values! (In reply to haro41 from comment #21) > (In reply to Térence Clastres from comment #20) > > (In reply to haro41 from comment #19) > > > > Thanks, I share very similar values. I thought adjusting OD_MCLK voltages > > > > would only set core voltage floor, but I'm not sure what this mean in > > > > practice. > > > > > > Yes, the OD_MCLK voltage values are (somehow missleading) actually core > > > voltages linked by indices in MCLK table. > > > > So why change the default values? > > The memory is always clocked in dependency to the current performance level, > 0-7 (GFXCLK). The indices (ucVddInd) in the powerplay table, are telling the > driver/smu, which memory clock have to be used for a specific performance > level. > > So, all you can adjust (in respect to memory clock) is this relation and the > memory clocks itself. > > Per default for the air cooled RX Vega64, MCLK 2 is used beginning with > performance level 2 (ucVddInd = 2) and MCLK 3 is used beginning with level 5 > (ucVddInd = 5). > > The SOCclock must be always above the current MCLK value, and this is where > core clock voltage matters! Hence you can't undervolt to much, if you are > using higher MCLK (and SOCclock) values! Got it, thank you very much. It still doesn't make me understand why after level 7 is reached, MEMCLK doesn't go lower than state 2 when at level 0, with Vddind set to 0 Indeed, that seems buggy. Just for verifification: Can you attach your modified powerplay table (as binary file)? Created attachment 145347 [details]
Tweaked PowerPlay table
Sure, here it is.
I can't reproduce this issue with your pp-table loaded. Works as expected here. I tried with 5.3.0-rc8 and with head of 'drm-next' branch from '~agd5f/linux'. Are your amdgpu firmware files up todate? (In reply to haro41 from comment #26) > I can't reproduce this issue with your pp-table loaded. Works as expected > here. > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from > '~agd5f/linux'. > > Are your amdgpu firmware files up todate? How do I check that? I should add that I found that it does indeed sometimes go back to 167MHz, but not all the time. (In reply to Térence Clastres from comment #27) > (In reply to haro41 from comment #26) > > I can't reproduce this issue with your pp-table loaded. Works as expected > > here. > > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from > > '~agd5f/linux'. > > > > Are your amdgpu firmware files up todate? > > How do I check that? > I should add that I found that it does indeed sometimes go back to 167MHz, > but not all the time. Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and 'vega10*' prefixed files are related to vega 56/64. Here you can download amdgpu firmware files: https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/tree/amdgpu (In reply to haro41 from comment #28) > (In reply to Térence Clastres from comment #27) > > (In reply to haro41 from comment #26) > > > I can't reproduce this issue with your pp-table loaded. Works as expected > > > here. > > > I tried with 5.3.0-rc8 and with head of 'drm-next' branch from > > > '~agd5f/linux'. > > > > > > Are your amdgpu firmware files up todate? > > > > How do I check that? > > I should add that I found that it does indeed sometimes go back to 167MHz, > > but not all the time. > > Firmware files for amdgpu are usually in '/lib/firmware/amdgpu' and > 'vega10*' prefixed files are related to vega 56/64. > > Here you can download amdgpu firmware files: > https://git.kernel.org/pub/scm/linux/kernel/git/firmware/linux-firmware.git/ > tree/amdgpu Thanks, comparing md5sum shows they are up-to-date. Still the same with 5.3 on Arch. I have this issue with vega 56 (red dragon from powercolor). I wonder if it occurs on vega 64 too. What I have found, is that in applications like blender or even Superposition Benchmark (https://benchmark.unigine.com/superposition) memory clocks boost up correctly, only when trying to launch any game (from steam, like dota 2) or just standalone one (Pillars from eternity, GOG installer), I see some spikes at 700MHz (but only at the beginning, maybe just software read error) and then memory is not going higher than 167Mhz. I see some answers but I am not sure, did anyone found working workaround? (In reply to Adam from comment #31) > I have this issue with vega 56 (red dragon from powercolor). I wonder if it > occurs on vega 64 too. > > What I have found, is that in applications like blender or even > Superposition Benchmark (https://benchmark.unigine.com/superposition) memory > clocks boost up correctly, only when trying to launch any game (from steam, > like dota 2) or just standalone one (Pillars from eternity, GOG installer), > I see some spikes at 700MHz (but only at the beginning, maybe just software > read error) and then memory is not going higher than 167Mhz. > > I see some answers but I am not sure, did anyone found working workaround? For me workaround is amdgpu.dpm=0 in kernel parameters. It switches off DPM but performance is ok (don't know what freqs it has in that case) as well as temperature (no much noise from fans). (In reply to Adam from comment #31) > I have this issue with vega 56 (red dragon from powercolor). I wonder if it > occurs on vega 64 too. > > What I have found, is that in applications like blender or even > Superposition Benchmark (https://benchmark.unigine.com/superposition) memory > clocks boost up correctly, only when trying to launch any game (from steam, > like dota 2) or just standalone one (Pillars from eternity, GOG installer), > I see some spikes at 700MHz (but only at the beginning, maybe just software > read error) and then memory is not going higher than 167Mhz. > > I see some answers but I am not sure, did anyone found working workaround? I see this behavoir on 3D load while VSYNC is enabled. I think in this case, the load is not large enough to trigger an permanent switch to a higher dpm level. The problem on my system is: with very low load it seems the ULV mode (0.800-0.050V=0.750) stays active and my GPU crashes sometimes. ... i forgot to mention, i am on kernel 5.3.1 meanwhile and this happens with and without any pp_table modification ... While debugging vega 10 powerplay, i found some code that looks fishy and could be related: from 'vega10_hwmgr.c': #define HBM_MEMORY_CHANNEL_WIDTH 128 static const uint32_t channel_number[] = {1, 2, 0, 4, 0, 8, 0, 16, 2}; ... static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr) { struct vega10_hwmgr *data = hwmgr->backend; PPTable_t *pp_table = &(data->smc_state_table.pp_table); ... pp_table->NumMemoryChannels = (uint16_t)(data->mem_channels); pp_table->MemoryChannelWidth = (uint16_t)(HBM_MEMORY_CHANNEL_WIDTH * channel_number[data->mem_channels]); ... } Debugging gives: NumMemoryChannels: 7 MemoryChannelWidth: 2048 Since 'data->mem_channels' is obviously meant as an index in 'channel_number[]', i think it is very unlikely, this code does what it was meant to do. Maybe it was more meant like this?: static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr) { struct vega10_hwmgr *data = hwmgr->backend; PPTable_t *pp_table = &(data->smc_state_table.pp_table); ... pp_table->NumMemoryChannels = channel_number[data->mem_channels]; pp_table->MemoryChannelWidth = HBM_MEMORY_CHANNEL_WIDTH; ... } Debugging gives: NumMemoryChannels: 16 MemoryChannelWidth: 128 If this is a real bug, it could be related to this thread, because the smc firmware will always calculate an higher available memory bandwidth and hence switches memory clock to low/late. (In reply to haro41 from comment #35) > While debugging vega 10 powerplay, i found some code that looks fishy and > could be related: > > from 'vega10_hwmgr.c': > > #define HBM_MEMORY_CHANNEL_WIDTH 128 > static const uint32_t channel_number[] = {1, 2, 0, 4, 0, 8, 0, 16, 2}; > ... > > static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr) > { > struct vega10_hwmgr *data = hwmgr->backend; > PPTable_t *pp_table = &(data->smc_state_table.pp_table); > ... > pp_table->NumMemoryChannels = (uint16_t)(data->mem_channels); > pp_table->MemoryChannelWidth = > (uint16_t)(HBM_MEMORY_CHANNEL_WIDTH * > channel_number[data->mem_channels]); > ... > } > > Debugging gives: > NumMemoryChannels: 7 > MemoryChannelWidth: 2048 > > Since 'data->mem_channels' is obviously meant as an index in > 'channel_number[]', i think it is very unlikely, this code does what it was > meant to do. > > > Maybe it was more meant like this?: > > static int vega10_populate_all_memory_levels(struct pp_hwmgr *hwmgr) > { > struct vega10_hwmgr *data = hwmgr->backend; > PPTable_t *pp_table = &(data->smc_state_table.pp_table); > ... > > pp_table->NumMemoryChannels = channel_number[data->mem_channels]; > pp_table->MemoryChannelWidth = HBM_MEMORY_CHANNEL_WIDTH; > ... > } > > Debugging gives: > NumMemoryChannels: 16 > MemoryChannelWidth: 128 > > > If this is a real bug, it could be related to this thread, because the smc > firmware will always calculate an higher available memory bandwidth and > hence switches memory clock to low/late. Did you try to test this? Afair Vega has exactly 2048 bit channel width. Otoh why do they calculate them during runtime? Or in a lower powerstate the bandwidth is reduced either (part of the bus is switched off)? Yes, the suggested code works, but i didn't test it much yet. I would say it switches earlier to higher memory clock levels. Vega 10 has 16 channels, 128bit each, makes 2048bit total. The available bandwith depends of total width and the current memory clock. So the smc can calculate the available bandwidth in realtime and adapt it to the current requirements by switching memory clock level up and down. (In reply to haro41 from comment #37) > Yes, the suggested code works, but i didn't test it much yet. I would say it > switches earlier to higher memory clock levels. > > Vega 10 has 16 channels, 128bit each, makes 2048bit total. > The available bandwith depends of total width and the current memory clock. > > So the smc can calculate the available bandwidth in realtime and adapt it to > the current requirements by switching memory clock level up and down. Thank you, I will compile it myself and test. I tested the proposed modification a bit more and find confirmed, that this values are affecting the memory clock level switching. For more significant test results, you can simple set 'pp_table->NumMemoryChannels to 1, like this: pp_table->NumMemoryChannels = 1; (In reply to haro41 from comment #39) > I tested the proposed modification a bit more and find confirmed, that this > values are affecting the memory clock level switching. > > For more significant test results, you can simple set > 'pp_table->NumMemoryChannels to 1, like this: > > pp_table->NumMemoryChannels = 1; In my case it just stuck at 500Mhz. pp_dpm_socclk should be adjusted manually too otherwise with some load I just see TV static again. ... after some more debugging, i found this code: static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr, bool has_disp) { smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, has_disp ? 1 : 0); } This function is called very early at driver initialization and as the names are suggesting, this code sends a message to smc to enable fast switching between the memory clock levels. This fast switching of memory clock levels seems to cause some problems, especially if the GPU load is limited (VSYNC on). As mentioned, i get crashes (with VSYNC enabled) sometimes and while a game is running the device spends some time in ULV (ULV: mclk=167Mhz vdd:0.750). I finally tried the following fix: static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr, bool has_disp) { smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, 0); smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100); } The first smc command disables fast mclk switching, the second sets a hysterese time of 100(ms?) for the mclk level down event. This fixed the crashing and the mclk levels are significantly more persistant (no ULV state while a game is running). I think there is ether a bug in the smc firmware in respect to fast mclk level switching, or the ULV mode and the fast mclk level switching are not meant to be used simultaneously. ... this bug could be related too: https://bugs.freedesktop.org/show_bug.cgi?id=109955 (In reply to haro41 from comment #41) > ... after some more debugging, i found this code: > > > static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr, > bool has_disp) > { > smum_send_msg_to_smc_with_parameter(hwmgr, > PPSMC_MSG_SetUclkFastSwitch, > has_disp ? 1 : 0); > } > > This function is called very early at driver initialization and as the names > are suggesting, this code sends a message to smc to enable fast switching > between the memory clock levels. > > This fast switching of memory clock levels seems to cause some problems, > especially if the GPU load is limited (VSYNC on). > As mentioned, i get crashes (with VSYNC enabled) sometimes and while a game > is running the device spends some time in ULV (ULV: mclk=167Mhz vdd:0.750). > > I finally tried the following fix: > > static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr, > bool has_disp) > { > smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, 0); > smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100); > > } > > The first smc command disables fast mclk switching, the second sets a > hysterese time of 100(ms?) for the mclk level down event. > > This fixed the crashing and the mclk levels are significantly more > persistant (no ULV state while a game is running). > > I think there is ether a bug in the smc firmware in respect to fast mclk > level switching, or the ULV mode and the fast mclk level switching are not > meant to be used simultaneously. It has changed nothing for me but i have a laptop system. Maybe there is something different. On my system the problem affects CPU freqs too. CPU also stuck at minimum 548Mhz regardless of the performance freq governor (maybe because of amdgpu.bapm but setting it to 0 does not change anything). Only dpm=0 helps. I don't think it is a fw problem since on Win 10 I don't have the issue (i guess fw blob is the same on both systems). The only problem I have both on Linux and Win is garbage on the screen after suspend-to-ram. I am on a desktop system and the only issue is this affinity to 167Mhz mclk and ULV, while 3D load is present (VSYNC enabled). Some crashes because of this. I think on your laptop, you will see a superposition of this issue and probably some additional issues. Can you try the following: compile your kernel with CONFIG_DYNAMIC_DEBUG=y add 'amdgpu.dyndbg=+pf log_buf_len=1M' to your kernel parameters. Print kernel log via 'dmesg |grep DCEFCLK!'. Can you see a message like this: ... amdgpu: [powerplay] Cannot find requested DCEFCLK! (In reply to haro41 from comment #44) > I am on a desktop system and the only issue is this affinity to 167Mhz mclk > and ULV, while 3D load is present (VSYNC enabled). Some crashes because of > this. > > I think on your laptop, you will see a superposition of this issue and > probably some additional issues. > > Can you try the following: > compile your kernel with CONFIG_DYNAMIC_DEBUG=y > add 'amdgpu.dyndbg=+pf log_buf_len=1M' to your kernel parameters. > > Print kernel log via 'dmesg |grep DCEFCLK!'. > > Can you see a message like this: > > ... amdgpu: [powerplay] Cannot find requested DCEFCLK! Yes exactly: [ 2.866153] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK! [ 2.867413] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK! [ 25.677850] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK! [ 26.305300] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK! [ 26.640150] vega10_notify_smc_display_config_after_ps_adjustment: amdgpu: [powerplay] Cannot find requested DCEFCLK! Thank you, so it is not only on my system. While this message is not necessarily related, perhaps it is worth to be investigated it a bit more. I investigated the behavior of SetUclkFastSwitch and SetUclkDownHyst commands (see comment #41) a bit more and have to add, the hysterese parameter is the only one that makes the difference and it seems to be a logic switch only. So the following additional line in function 'vega10_notify_smc_display_change()' fixes the crashes for me: static void vega10_notify_smc_display_change(struct pp_hwmgr *hwmgr, bool has_disp) { smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkFastSwitch, has_disp ? 1 : 0); /* enable hysterese for mclk switching, to workaround crashes */ smum_send_msg_to_smc_with_parameter(hwmgr, PPSMC_MSG_SetUclkDownHyst, 100); } (This is with one display connected only) ... kernel 5.3.1 here and the parameter for PPSMC_MSG_SetUclkDownHyst can be anything above 0 ... I can confirm that with this change crash is gone (kernel 5.3.7) and i even can set higher freqs in manual mode. But automatic freq scaling still does not work (freq is stuck at 167MHz). (In reply to velemas from comment #49) > I can confirm that with this change crash is gone (kernel 5.3.7) and i even > can set higher freqs in manual mode. But automatic freq scaling still does > not work (freq is stuck at 167MHz). mclk level switching happens very fast to get a chance to observe it, while VSYNC is activarted you will need a monitoring script. Maybe you try this one i attached, it have to be run as root. Created attachment 145828 [details]
dpm varible monitoring script
run this with root pivileges and adapt the sampling interval as needed
Created attachment 145829 [details]
amdgpu-mon.log
I changed hwmon0 to hwmon1 in your script because for some reason my system has only that directory (i have only 1 discrete GPU, no APU).
I tried to play Pine in automatic mode with your script running. The performance was really bad.
wc -l amdgpu-mon.log
3065 amdgpu-mon.log
`grep -v "mclk: 167" amdgpu-mon.log` gives only few entries:
sclk: -1945 mclk: 500 vdd: 837
sclk: -1948 mclk: 500 vdd: 837
sclk: -1903 mclk: 500 vdd: 843
sclk: -1949 mclk: 500 vdd: 812
sclk: -1943 mclk: 500 vdd: 837
sclk: -1944 mclk: 500 vdd: 837
sclk: -1952 mclk: 500 vdd: 837
sclk: -1953 mclk: 500 vdd: 812
sclk: -1952 mclk: 500 vdd: 837
sclk: -1953 mclk: 500 vdd: 812
sclk: -1953 mclk: 500 vdd: 837
sclk: -1965 mclk: 500 vdd: 812
sclk: -1963 mclk: 500 vdd: 837
sclk: -1962 mclk: 500 vdd: 837
sclk: -1966 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 837
sclk: -1972 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 812
sclk: -1966 mclk: 500 vdd: 837
And sclk has unreal values.
Do you see better performance in prime with mclk level forced to high? echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk Do you see such negative values for sclk via direct read from sysfs too? Created attachment 145838 [details]
dpm monitor script
(In reply to haro41 from comment #53) > Do you see better performance in prime with mclk level forced to high? > > echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level > echo "3" > /sys/class/drm/card0/device/pp_dpm_mclk > > Do you see such negative values for sclk via direct read from sysfs too? Please see my comment 11. Nothing has changed since then. I can set higher sclk and mclk levels with good performance but only when power cord is plugged after kernel is booted (power led is on otherwise the led is switched off once kernel is booted). But switching levels back and forth does not work. I am investigating ac[dc]_power in vega10_hwmgr.c. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/801. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.