Bug 111987

Summary: Unstable performance (periodic and repeating patterns of fps change) and changing VDDGFX
Product: DRI Reporter: Witold Baryluk <witold.baryluk+freedesktop>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium CC: witold.baryluk+freedesktop
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Frametimes during execution of The Talos Principle (64-bit) while looking at the ground/wall
none
Frametimes during run of Overwatch (Wine+DXVK) and OBS in background (not recording or even previewing!). none

Description Witold Baryluk 2019-10-13 05:03:14 UTC
AMD Radeon Fury X.

Linux debian 5.2.0-3-amd64 #1 SMP Debian 5.2.17-1 (2019-09-26) x86_64 GNU/Linux

ii  xserver-xorg-video-radeon                                   1:19.0.1-1                            amd64        X.Org X server -- AMD/ATI Radeon display driver
ii  xserver-xorg-video-amdgpu                                   19.0.1-1                              amd64        X.Org X server -- AMDGPU display driver
ii  xserver-xorg-video-radeon                                   1:19.0.1-1                            amd64        X.Org X server -- AMD/ATI Radeon display driver
ii  libdrm-radeon1:amd64                                        2.4.99-1                              
amd64        Userspace interface to radeon-specific kernel DRM services -- runtime
ii  libdrm-amdgpu1:amd64                                        2.4.99-1                              amd64        Userspace interface to amdgpu-specific kernel DRM services -- runtime


I was able to reproduce the issue in few titles:

Overwatch (64-bit Windows game) with various Wine and DXVK versions, as well when using Wine OpenGL renderer.
Talos (native 64 bit Linux game) with Vulkan renderer.

Tested with both Mesa 19.2.1-1 with LLVM 9 from Debian, and custom compiled Mesa 19.3.0-devel with LLVM 10 and ACO backend compilers.

If I setup the game to render constantly same things on screen (I do that by simply going to a corner of the map, and looking at the ground or a corner, where there is minimal amount of geometry and variability), I initially get very high and stable frame rate, of lets say 105 FPS (plus minus 1 FPS). However, if I wait long enough there are periodic (not sporadic, but actually periodic, and exactly repeatable) situations where FPS drops. During that period the GPU load increases from 30% to 100%, sometimes with one or two intermediate steps (depends on the game and setup).

I also notice that the GPU VDD is changing during these period.

I eliminated all other sources of variability. Nothing running in background.

Reported GPU temperature is stable at <32 deg C, and during testing is stable and flat.

Sometimes, if I keep the game running long enough, it will stabilize and stop doing that. But sometimes if I wait long enough it will reenter this behaviour back. Most of the time the behaviour is extremaly repetitive and predictable. Not random.

Please see attached frametime graph (captured with modified Mesa vulkan overlay) for Talos and Overatch.
Comment 1 Witold Baryluk 2019-10-13 05:03:47 UTC
Created attachment 145723 [details]
Frametimes during execution of The Talos Principle (64-bit) while looking at the ground/wall
Comment 2 Witold Baryluk 2019-10-13 05:04:44 UTC
Created attachment 145724 [details]
Frametimes during run of Overwatch (Wine+DXVK) and OBS in background (not recording or even previewing!).
Comment 3 Witold Baryluk 2019-10-13 05:08:30 UTC
I initially blamed OBS (Open Broadcasting Studio) for the problem. But I was able to reproduce the issue even with OBS recording, previewing, grabbing frame, or even it running.

So I am almost sure it is hardware or power management issue in kernel driver.

Unfortunately I am not quite able to capture timelines of the /sys/kernel/debug/dri/0/amdgpu_pm_info correlated with the frametimes, as very often reading this sysfs file will block all the rendering for few milliseconds, skewing results (frametime spikes).

However, I can tell that the temperature reported there is constantly below 32 deg C, and frequency looks all the time the same, at least for the GPU core.

The voltage and GPU load reported there is all over the place, sometimes reporting 0% GPU load, despite the game running in windowed mode and producing 200 FPS, which should be at least ~30% GPU load from my other measurements.

I also confirmed FPS / frametimes issues with 3 other independent methods (DXVK_HUD=fps, GALLIUM_HUD=fps, in-game fps / frametime counters). But main one is using modified Mesa overlay.
Comment 4 Witold Baryluk 2019-10-13 05:08:54 UTC
> I initially blamed OBS (Open Broadcasting Studio) for the problem. But I was able to reproduce the issue even with OBS recording, previewing, grabbing frame, or even it running.

s/with/WITHOUT/.
Comment 5 Witold Baryluk 2019-10-13 05:13:11 UTC
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info 
VCE feature version: 0, firmware version: 0x37020300
UVD feature version: 0, firmware version: 0x015b0c00
MC feature version: 0, firmware version: 0x00000000
ME feature version: 49, firmware version: 0x000000a7
PFP feature version: 49, firmware version: 0x000000fd
CE feature version: 49, firmware version: 0x0000008c
RLC feature version: 1, firmware version: 0x000000d6
RLC SRLC feature version: 0, firmware version: 0x00000000
RLC SRLG feature version: 0, firmware version: 0x00000000
RLC SRLS feature version: 0, firmware version: 0x00000000
MEC feature version: 49, firmware version: 0x000002d9
MEC2 feature version: 49, firmware version: 0x000002d9
SOS feature version: 0, firmware version: 0x00000000
ASD feature version: 0, firmware version: 0x00000000
TA XGMI feature version: 0, firmware version: 0x00000000
TA RAS feature version: 0, firmware version: 0x00000000
SMC feature version: 0, firmware version: 0x00161000
SDMA0 feature version: 31, firmware version: 0x00000022
SDMA1 feature version: 0, firmware version: 0x00000022
VCN feature version: 0, firmware version: 0x00000000
DMCU feature version: 0, firmware version: 0x00000000
VBIOS version: 113-C8800100-102
#

ii  firmware-amd-graphics                                       20190717-2                            all          Binary firmware for AMD/ATI graphics chips


Some pieces from dmesg:

Oct 10 16:33:44 localhost kernel: [    1.421938] [drm] amdgpu kernel modesetting enabled.
Oct 10 16:33:44 localhost kernel: [    1.421996] Parsing CRAT table with 2 nodes
Oct 10 16:33:44 localhost kernel: [    1.422003] Ignoring ACPI CRAT on non-APU system
Oct 10 16:33:44 localhost kernel: [    1.422005] Virtual CRAT table created for CPU
Oct 10 16:33:44 localhost kernel: [    1.422006] Parsing CRAT table with 2 nodes
Oct 10 16:33:44 localhost kernel: [    1.422007] Creating topology SYSFS entries
Oct 10 16:33:44 localhost kernel: [    1.422020] Topology: Add CPU node
Oct 10 16:33:44 localhost kernel: [    1.422020] Finished initializing topology
Oct 10 16:33:44 localhost kernel: [    1.422178] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x80000000 -> 0x8fffffff
Oct 10 16:33:44 localhost kernel: [    1.422180] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x90000000 -> 0x901fffff
Oct 10 16:33:44 localhost kernel: [    1.422181] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x9f800000 -> 0x9f83ffff
Oct 10 16:33:44 localhost kernel: [    1.422183] checking generic (80000000 1f0000) vs hw (80000000 10000000)
Oct 10 16:33:44 localhost kernel: [    1.422184] fb0: switching to amdgpudrmfb from EFI VGA
Oct 10 16:33:44 localhost kernel: [    1.422209] Console: switching to colour dummy device 80x25
Oct 10 16:33:44 localhost kernel: [    1.422284] amdgpu 0000:43:00.0: vgaarb: deactivate vga console
Oct 10 16:33:44 localhost kernel: [    1.422549] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xC8).
Oct 10 16:33:44 localhost kernel: [    1.422563] [drm] register mmio base: 0x9F800000
Oct 10 16:33:44 localhost kernel: [    1.422563] [drm] register mmio size: 262144
Oct 10 16:33:44 localhost kernel: [    1.422573] [drm] add ip block number 0 <vi_common>
Oct 10 16:33:44 localhost kernel: [    1.422573] [drm] add ip block number 1 <gmc_v8_0>
Oct 10 16:33:44 localhost kernel: [    1.422574] [drm] add ip block number 2 <tonga_ih>
Oct 10 16:33:44 localhost kernel: [    1.422575] [drm] add ip block number 3 <gfx_v8_0>
Oct 10 16:33:44 localhost kernel: [    1.422576] [drm] add ip block number 4 <sdma_v3_0>
Oct 10 16:33:44 localhost kernel: [    1.422577] [drm] add ip block number 5 <powerplay>
Oct 10 16:33:44 localhost kernel: [    1.422578] [drm] add ip block number 6 <dm>
Oct 10 16:33:44 localhost kernel: [    1.422579] [drm] add ip block number 7 <uvd_v6_0>
Oct 10 16:33:44 localhost kernel: [    1.422579] [drm] add ip block number 8 <vce_v3_0>
Oct 10 16:33:44 localhost kernel: [    1.422594] [drm] UVD is enabled in physical mode
Oct 10 16:33:44 localhost kernel: [    1.422595] [drm] VCE enabled in physical mode
Oct 10 16:33:44 localhost kernel: [    1.424046] ATOM BIOS: 113-C8800100-102
Oct 10 16:33:44 localhost kernel: [    1.424070] [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0]
Oct 10 16:33:44 localhost kernel: [    1.424073] [drm] vm size is 512 GB, 2 levels, block size is 10-bit, fragment size is 9-bit
Oct 10 16:33:44 localhost kernel: [    1.424080] amdgpu 0000:43:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used)
Oct 10 16:33:44 localhost kernel: [    1.424081] amdgpu 0000:43:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF
Oct 10 16:33:44 localhost kernel: [    1.424085] [drm] Detected VRAM RAM=4096M, BAR=256M
Oct 10 16:33:44 localhost kernel: [    1.424086] [drm] RAM width 512bits HBM
Oct 10 16:33:44 localhost kernel: [    1.424126] [TTM] Zone  kernel: Available graphics memory: 65980746 KiB
Oct 10 16:33:44 localhost kernel: [    1.424126] [TTM] Zone   dma32: Available graphics memory: 2097152 KiB
Oct 10 16:33:44 localhost kernel: [    1.424127] [TTM] Initializing pool allocator
Oct 10 16:33:44 localhost kernel: [    1.424129] [TTM] Initializing DMA pool allocator
Oct 10 16:33:44 localhost kernel: [    1.424158] [drm] amdgpu: 4096M of VRAM memory ready
Oct 10 16:33:44 localhost kernel: [    1.424160] [drm] amdgpu: 4096M of GTT memory ready.
Oct 10 16:33:44 localhost kernel: [    1.424173] [drm] GART: num cpu pages 262144, num gpu pages 262144
Oct 10 16:33:44 localhost kernel: [    1.424227] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000).
Oct 10 16:33:44 localhost kernel: [    1.424307] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_pfp.bin
Oct 10 16:33:44 localhost kernel: [    1.424318] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_me.bin
Oct 10 16:33:44 localhost kernel: [    1.424328] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_ce.bin
Oct 10 16:33:44 localhost kernel: [    1.424329] [drm] Chained IB support enabled!
Oct 10 16:33:44 localhost kernel: [    1.424340] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_rlc.bin
Oct 10 16:33:44 localhost kernel: [    1.424404] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_mec.bin
Oct 10 16:33:44 localhost kernel: [    1.424450] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_mec2.bin
Oct 10 16:33:44 localhost kernel: [    1.425016] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_sdma.bin
Oct 10 16:33:44 localhost kernel: [    1.425026] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_sdma1.bin
Oct 10 16:33:44 localhost kernel: [    1.425135] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_uvd.bin
Oct 10 16:33:44 localhost kernel: [    1.425136] [drm] Found UVD firmware Version: 1.91 Family ID: 12
Oct 10 16:33:44 localhost kernel: [    1.425138] [drm] UVD ENC is disabled
Oct 10 16:33:44 localhost kernel: [    1.425579] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_vce.bin
Oct 10 16:33:44 localhost kernel: [    1.425580] [drm] Found VCE firmware Version: 55.2 Binary ID: 3
Oct 10 16:33:44 localhost kernel: [    1.425851] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_smc.bin
...
Oct 10 16:33:44 localhost kernel: [    1.496480] [drm] Display Core initialized with v3.2.27!
Oct 10 16:33:44 localhost kernel: [    1.524874] nvme nvme0: Shutdown timeout set to 8 seconds
Oct 10 16:33:44 localhost kernel: [    1.526321] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Oct 10 16:33:44 localhost kernel: [    1.526322] [drm] Driver supports precise vblank timestamp query.
Oct 10 16:33:44 localhost kernel: [    1.547155] nvme nvme0: 32/0/0 default/read/poll queues
Oct 10 16:33:44 localhost kernel: [    1.552924] [drm] UVD initialized successfully.
Oct 10 16:33:44 localhost kernel: [    1.653044] [drm] VCE initialized successfully.
Oct 10 16:33:44 localhost kernel: [    1.654424] kfd kfd: Allocated 3969056 bytes on gart
Oct 10 16:33:44 localhost kernel: [    1.654436] Virtual CRAT table created for GPU
Oct 10 16:33:44 localhost kernel: [    1.654437] Parsing CRAT table with 1 nodes
Oct 10 16:33:44 localhost kernel: [    1.654448] Creating topology SYSFS entries
Oct 10 16:33:44 localhost kernel: [    1.654548] Topology: Add dGPU node [0x7300:0x1002]
Oct 10 16:33:44 localhost kernel: [    1.654640] kfd kfd: added device 1002:7300
Oct 10 16:33:44 localhost kernel: [    1.656602] [drm] fb mappable at 0x8086B000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] vram apper at 0x80000000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] size 16384000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] fb depth is 24
Oct 10 16:33:44 localhost kernel: [    1.656604] [drm]    pitch is 10240
Oct 10 16:33:44 localhost kernel: [    1.656646] fbcon: amdgpudrmfb (fb0) is primary device
Oct 10 16:33:44 localhost kernel: [    1.678003] Console: switching to colour frame buffer device 320x100
Oct 10 16:33:44 localhost kernel: [    1.695638] amdgpu 0000:43:00.0: fb0: amdgpudrmfb frame buffer device
Oct 10 16:33:44 localhost kernel: [    1.704979] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:43:00.0 on minor 0
...
Oct 10 16:33:44 localhost kernel: [    1.483252] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
...
Oct 10 16:33:44 localhost kernel: [    1.496480] [drm] Display Core initialized with v3.2.27!
Oct 10 16:33:44 localhost kernel: [    1.524874] nvme nvme0: Shutdown timeout set to 8 seconds
Oct 10 16:33:44 localhost kernel: [    1.526321] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
Oct 10 16:33:44 localhost kernel: [    1.526322] [drm] Driver supports precise vblank timestamp query.
Oct 10 16:33:44 localhost kernel: [    1.547155] nvme nvme0: 32/0/0 default/read/poll queues
Oct 10 16:33:44 localhost kernel: [    1.552924] [drm] UVD initialized successfully.
Oct 10 16:33:44 localhost kernel: [    1.653044] [drm] VCE initialized successfully.
Oct 10 16:33:44 localhost kernel: [    1.654424] kfd kfd: Allocated 3969056 bytes on gart
Oct 10 16:33:44 localhost kernel: [    1.654436] Virtual CRAT table created for GPU
Oct 10 16:33:44 localhost kernel: [    1.654437] Parsing CRAT table with 1 nodes
Oct 10 16:33:44 localhost kernel: [    1.654448] Creating topology SYSFS entries
Oct 10 16:33:44 localhost kernel: [    1.654548] Topology: Add dGPU node [0x7300:0x1002]
Oct 10 16:33:44 localhost kernel: [    1.654640] kfd kfd: added device 1002:7300
Oct 10 16:33:44 localhost kernel: [    1.656602] [drm] fb mappable at 0x8086B000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] vram apper at 0x80000000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] size 16384000
Oct 10 16:33:44 localhost kernel: [    1.656603] [drm] fb depth is 24
Oct 10 16:33:44 localhost kernel: [    1.656604] [drm]    pitch is 10240
Oct 10 16:33:44 localhost kernel: [    1.656646] fbcon: amdgpudrmfb (fb0) is primary device
Oct 10 16:33:44 localhost kernel: [    1.678003] Console: switching to colour frame buffer device 320x100
Oct 10 16:33:44 localhost kernel: [    1.695638] amdgpu 0000:43:00.0: fb0: amdgpudrmfb frame buffer device
Oct 10 16:33:44 localhost kernel: [    1.704979] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:43:00.0 on minor 0
Comment 6 Witold Baryluk 2019-10-13 05:14:06 UTC
lspci:

43:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c8) (prog-if 00 [VGA controller])
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Radeon R9 FURY X / NANO
	Flags: bus master, fast devsel, latency 0, IRQ 65, NUMA node 1
	Memory at 80000000 (64-bit, prefetchable) [size=256M]
	Memory at 90000000 (64-bit, prefetchable) [size=2M]
	I/O ports at 8000 [size=256]
	Memory at 9f800000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [200] Resizable BAR <?>
	Capabilities: [270] Secondary PCI Express <?>
	Capabilities: [2b0] Address Translation Service (ATS)
	Capabilities: [2c0] Page Request Interface (PRI)
	Capabilities: [2d0] Process Address Space ID (PASID)
	Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

43:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X]
	Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X]
	Flags: bus master, fast devsel, latency 0, IRQ 164, NUMA node 1
	Memory at 9f860000 (64-bit, non-prefetchable) [size=16K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
	Capabilities: [58] Express Legacy Endpoint, MSI 00
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
	Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150] Advanced Error Reporting
	Capabilities: [328] Alternative Routing-ID Interpretation (ARI)
	Kernel driver in use: snd_hda_intel
	Kernel modules: snd_hda_intel
Comment 7 Witold Baryluk 2019-10-13 05:24:15 UTC
I just also reproduced it using Firefox with WebGL and ShaderToy with custom reasonably complex shader to load the GPU. With frame rate capped to 60 fps, so that required reasonably complex shader to load it high. I.e. 17 fps at lows, and 27 fps at highs. The fps patterns is again periodic and repeating, despite the shader not depending on time (constantly output a full screen quad with same pixel content and execution).

GPU temperature is constant 28 deg C even at full blast.

VDD changes from 1.25 to 1.14, 1.10, 1.05, 0.94, 0.90, and sometimes even to 0.85 V. After a while it jumps back to 1.25V and cycle repeats.
Comment 8 Witold Baryluk 2019-10-13 09:34:06 UTC
My random guess is that it is due to some bug in calculating gpu_busy_percent possibly, or SCLK_{UP,DOWN}_HYST parameters:

Here is a dump of various values, when running Talos:

Sun Oct 13 09:29:46 UTC 2019
/sys/class/drm/renderD128/device files:
gpu_busy_percent:5
pp_num_states:states: 2
pp_num_states:0 boot
pp_num_states:1 performance
pp_cur_state:1
pp_power_profile_mode:NUM        MODE_NAME     SCLK_UP_HYST   SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL     MCLK_UP_HYST   MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL
pp_power_profile_mode:  0   BOOTUP_DEFAULT:        -                -                -                -                -                -
pp_power_profile_mode:  1   3D_FULL_SCREEN:        0              100               30                0              100               10
pp_power_profile_mode:  2     POWER_SAVING:       10                0               30                -                -                -
pp_power_profile_mode:  3            VIDEO:        -                -                -               10               16               31
pp_power_profile_mode:  4               VR:        0               11               50                0              100               10
pp_power_profile_mode:  5        COMPUTE *:        0                5               30                0              100               10
pp_power_profile_mode:  6           CUSTOM:        -                -                -                -                -                -
vbios_version:113-C8800100-102
revision:0xc8
mem_info_gtt_total:4294967296
mem_info_gtt_used:364883968
mem_info_vis_vram_total:268435456
mem_info_vis_vram_used:50188288
mem_info_vram_total:4294967296
mem_info_vram_used:2290094080
power_dpm_force_performance_level:manual
power_dpm_state:performance
hwmon/hwmon0/freq1_label:sclk
hwmon/hwmon0/freq1_input:944000000
hwmon/hwmon0/freq2_label:mclk
hwmon/hwmon0/freq2_input:500000000
hwmon/hwmon0/temp1_input:31000
hwmon/hwmon0/power1_average:85182000
hwmon/hwmon0/fan1_input:3069
hwmon/hwmon0/pwm1:239
hwmon/hwmon0/in0_label:vddgfx
hwmon/hwmon0/in0_input:1100
pp_dpm_mclk:0: 500Mhz *
pp_dpm_sclk:0: 300Mhz 
pp_dpm_sclk:1: 512Mhz 
pp_dpm_sclk:2: 724Mhz 
pp_dpm_sclk:3: 892Mhz 
pp_dpm_sclk:4: 944Mhz *
pp_dpm_sclk:5: 984Mhz 
pp_dpm_sclk:6: 1018Mhz 
pp_dpm_sclk:7: 1050Mhz 
pp_mclk_od:0
pp_sclk_od:0
d3cold_allowed:1
power_dpm_force_performance_level:manual
/sys/class/drm/ttm/buffer_objects/bo_count:2997
/sys/kernel/debug/dri/0/clients:             command   pid dev master a   uid      magic
/sys/kernel/debug/dri/0/clients:                Xorg 118921   0   y    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:                Xorg 118921 128   n    y     0          0
/sys/kernel/debug/dri/0/clients:               Talos 37263 128   n    n  1000          0


The default pp_power_profile_mode was 1 (3D_FULL_SCREEN). I switched it just now to 5 (COMPUTE), but results are the same.

The gpu_busy_percent switches every second between something close to 0 and something close to 100. Despite the game constantly submitting the work and rendering at ~200 FPS.

The pp_dpm_sclk starts high and slowly goes down step by step, and eventually goes back to 7 (1050MHz), and process repeats.

I tried doing:

echo "7" >  /sys/class/drm/card0/device/pp_dpm_sclk
echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level

and I see no differences, still same effect.
Comment 9 Witold Baryluk 2019-10-13 10:08:13 UTC
Kernel config, nothing special:

$ grep DRM_AMD /boot/config-5.2.0-3-amd64 
CONFIG_DRM_AMDGPU=m
CONFIG_DRM_AMDGPU_SI=y
CONFIG_DRM_AMDGPU_CIK=y
CONFIG_DRM_AMDGPU_USERPTR=y
# CONFIG_DRM_AMDGPU_GART_DEBUGFS is not set
CONFIG_DRM_AMD_ACP=y
CONFIG_DRM_AMD_DC=y
CONFIG_DRM_AMD_DC_DCN1_0=y
CONFIG_DRM_AMD_DC_DCN1_01=y
$
Comment 10 Witold Baryluk 2019-10-13 10:25:16 UTC
Firmware signatures:

user@debian:/lib/firmware/amdgpu$ sha256sum fiji_*
615693b2736f13c4ef3cd9220efe4d55df3c5d82fe128d3f1b34a45edba65fbd  fiji_ce.bin
b0d51dc0b361afa07bcefa0f4670c344679b1fcbe1be68c06e727eaaf0098236  fiji_mc.bin
953747f5b93bd743bb75747b950be3e4ccbe481ac1f7110a58d399ac840f158a  fiji_me.bin
cd1133103874ce368c4f46eeb38fe293caad5f77e4fee8567f6f6be9c47687c4  fiji_mec2.bin
cd1133103874ce368c4f46eeb38fe293caad5f77e4fee8567f6f6be9c47687c4  fiji_mec.bin
91bda514a4d0d846d48321aa4d3c92ff1049fe53cbf3e007686553a29a9018de  fiji_pfp.bin
f0fa903f16502cff35dc073a77c1ef382f4218ec2928f23a173400888f90400d  fiji_rlc.bin
1c5ab71e854cc59e4998559ed07c436d05b2a97b0df0a51a3924b1c240398949  fiji_sdma1.bin
b5cf6b3a3b7e6839a68a92ac8651d53d0ae41e1caee28c68155b2d7865f1cf4c  fiji_sdma.bin
fd13fe6b32cef9129f1b75f46b014babcf4075ebc8a715bf19da573be8b68223  fiji_smc.bin
b7401cfda1087ee5cf71acef19163f311c71775802b331ca82b9177119e4d97b  fiji_uvd.bin
0fe1a2e4e2e4f6f8d5600d8c13cb60a8bc87089cd37c766fef2f95ccd5e277ac  fiji_vce.bin
user@debian:/lib/firmware/amdgpu$
Comment 11 Alex Deucher 2019-10-15 20:54:57 UTC
The GPU dynamically adjusts the clocks and voltages at runtime based on the load on the particular engines or hw subsytems.  You can use the pp_power_profile_mode interface to adjust the heuristics that determine how the GPU transitions through power states.
Comment 12 Witold Baryluk 2019-10-16 12:44:33 UTC
Hi Alex.

I do understand that, it is a part of power management. That is not the bug is about.

I did use pp_power_mode_profile too, and it doesn't really help. The issue is that I would expect the performance to stabilize and frequency and voltages to converge to satisfy the load, but they don't. The workload this is happening at isn't GPU limited (it starts with GPU load of about 40%), so it is not fully representative of other workloads, but frequency transitions looks suboptimal.

Is it possible to set custom SCLK hysteresis maybe?

As I said I tried settings up:


echo "7" >  /sys/class/drm/card0/device/pp_dpm_sclk
echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level

to confine sclk to single level, but it didn't help. I tried changing the pp_power_mode_profile to COMPUTE (just to see what happens), and there was no difference in observed behaviour.

I don't have any amdgpu.ppfeaturemask sets tho, so maybe kernel driver is simply ignoring my requests.
Comment 13 deathlock13 2019-10-17 11:51:44 UTC
did U 'cat pp_dpm_sclk' after 'echo 7 > ...' ?
I have to boot with amdgpu.runpm=0 to be able to change anything
drawback is - notebook fan goes off every2-3 min, for 30 sec, in idle :/
Comment 14 Alex Deucher 2019-10-17 13:06:57 UTC
You can force the clocks low or high by:
echo low > power_dpm_force_performance_level
or
echo high > power_dpm_force_performance_level
setting it to auto will restore the automatic behavior:
echo auto > power_dpm_force_performance_level

The behavior will depend on the workload.  If the workload is really bursty, it may cause the clocks to ramp up and down if there are sufficiently long idle periods between workloads.  You can manually adjust the heuristics by selecting the custom profile and tweaking each parameter.  See the documentation here:
https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html#gpu-power-thermal-controls-and-monitoring
Comment 15 Witold Baryluk 2019-10-17 18:12:30 UTC
Setting

echo high > power_dpm_force_performance_level

didn't help. It is set to high (verified by reading back from sysfs).

gpu_busy_percent is showing me 100 all the time, and sometimes jumps a little lower, maybe 87, 95, then back to 100.

Despite all of this the GPU frequency is slowly and gradually dropping down, until it reaches the lowest frequency and then jumps back to maximum.

It does look like a bug to me...
Comment 16 Alex Deucher 2019-10-17 18:19:46 UTC
It sounds like the GPU is getting throttled due to power temperature.  Is the cooling solution on your GPU clean and working properly?  Do you have an adequate power supply?
Comment 17 Witold Baryluk 2019-10-17 18:41:24 UTC
I also tried echo profile_peak > power_dpm_force_performance_level , and initially the sclk stays at the highest level, but after a minute, it does drop just like other profiles.

As of of cooling and PSU. I am sure I have plenty of headroom.

The hwmon reports ~207W power at highest clock state, but temperature as reported by hwmon stays at 39 deg C. At lower clock states it drops to 37 deg C, and to about 35 deg C at idle.

Looks to be just fine to me.

My PC case do have extra ventilation, and CPU is pretty much idle.

My PSU is Seasonic Prime Titanium (1000W). I have:

Motherboard: MSI MEG Creation X399
GPU: AMD Threadripper 2950X at stock, liquid cooled, idle (~5% load), about 46 deg C.
RAM: 8x16GB Samsung DDR4 ECC
GPU: AMD Radeon Fury X, water cooled
NIC: Melanox ConnectX-3
Storage: 2x Samsung PM983 via U.2
Cooling: Plenty of fans in Fractal Define R6 USB-C case.

I just opened the case, and verified that two 8-pin PCI-E connectors are connected to GPU and PSU. I also verified that extra (optional, even on multi GPU systems) power connectors to MB are also connected to MB and PSU.

GPU radiator is above the GPU, and is not hot to the touch or anything. Fan on it is spinning at max.
Comment 18 Martin Peres 2019-11-19 09:58:03 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/936.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.