AMD Radeon Fury X. Linux debian 5.2.0-3-amd64 #1 SMP Debian 5.2.17-1 (2019-09-26) x86_64 GNU/Linux ii xserver-xorg-video-radeon 1:19.0.1-1 amd64 X.Org X server -- AMD/ATI Radeon display driver ii xserver-xorg-video-amdgpu 19.0.1-1 amd64 X.Org X server -- AMDGPU display driver ii xserver-xorg-video-radeon 1:19.0.1-1 amd64 X.Org X server -- AMD/ATI Radeon display driver ii libdrm-radeon1:amd64 2.4.99-1 amd64 Userspace interface to radeon-specific kernel DRM services -- runtime ii libdrm-amdgpu1:amd64 2.4.99-1 amd64 Userspace interface to amdgpu-specific kernel DRM services -- runtime I was able to reproduce the issue in few titles: Overwatch (64-bit Windows game) with various Wine and DXVK versions, as well when using Wine OpenGL renderer. Talos (native 64 bit Linux game) with Vulkan renderer. Tested with both Mesa 19.2.1-1 with LLVM 9 from Debian, and custom compiled Mesa 19.3.0-devel with LLVM 10 and ACO backend compilers. If I setup the game to render constantly same things on screen (I do that by simply going to a corner of the map, and looking at the ground or a corner, where there is minimal amount of geometry and variability), I initially get very high and stable frame rate, of lets say 105 FPS (plus minus 1 FPS). However, if I wait long enough there are periodic (not sporadic, but actually periodic, and exactly repeatable) situations where FPS drops. During that period the GPU load increases from 30% to 100%, sometimes with one or two intermediate steps (depends on the game and setup). I also notice that the GPU VDD is changing during these period. I eliminated all other sources of variability. Nothing running in background. Reported GPU temperature is stable at <32 deg C, and during testing is stable and flat. Sometimes, if I keep the game running long enough, it will stabilize and stop doing that. But sometimes if I wait long enough it will reenter this behaviour back. Most of the time the behaviour is extremaly repetitive and predictable. Not random. Please see attached frametime graph (captured with modified Mesa vulkan overlay) for Talos and Overatch.
Created attachment 145723 [details] Frametimes during execution of The Talos Principle (64-bit) while looking at the ground/wall
Created attachment 145724 [details] Frametimes during run of Overwatch (Wine+DXVK) and OBS in background (not recording or even previewing!).
I initially blamed OBS (Open Broadcasting Studio) for the problem. But I was able to reproduce the issue even with OBS recording, previewing, grabbing frame, or even it running. So I am almost sure it is hardware or power management issue in kernel driver. Unfortunately I am not quite able to capture timelines of the /sys/kernel/debug/dri/0/amdgpu_pm_info correlated with the frametimes, as very often reading this sysfs file will block all the rendering for few milliseconds, skewing results (frametime spikes). However, I can tell that the temperature reported there is constantly below 32 deg C, and frequency looks all the time the same, at least for the GPU core. The voltage and GPU load reported there is all over the place, sometimes reporting 0% GPU load, despite the game running in windowed mode and producing 200 FPS, which should be at least ~30% GPU load from my other measurements. I also confirmed FPS / frametimes issues with 3 other independent methods (DXVK_HUD=fps, GALLIUM_HUD=fps, in-game fps / frametime counters). But main one is using modified Mesa overlay.
> I initially blamed OBS (Open Broadcasting Studio) for the problem. But I was able to reproduce the issue even with OBS recording, previewing, grabbing frame, or even it running. s/with/WITHOUT/.
# cat /sys/kernel/debug/dri/0/amdgpu_firmware_info VCE feature version: 0, firmware version: 0x37020300 UVD feature version: 0, firmware version: 0x015b0c00 MC feature version: 0, firmware version: 0x00000000 ME feature version: 49, firmware version: 0x000000a7 PFP feature version: 49, firmware version: 0x000000fd CE feature version: 49, firmware version: 0x0000008c RLC feature version: 1, firmware version: 0x000000d6 RLC SRLC feature version: 0, firmware version: 0x00000000 RLC SRLG feature version: 0, firmware version: 0x00000000 RLC SRLS feature version: 0, firmware version: 0x00000000 MEC feature version: 49, firmware version: 0x000002d9 MEC2 feature version: 49, firmware version: 0x000002d9 SOS feature version: 0, firmware version: 0x00000000 ASD feature version: 0, firmware version: 0x00000000 TA XGMI feature version: 0, firmware version: 0x00000000 TA RAS feature version: 0, firmware version: 0x00000000 SMC feature version: 0, firmware version: 0x00161000 SDMA0 feature version: 31, firmware version: 0x00000022 SDMA1 feature version: 0, firmware version: 0x00000022 VCN feature version: 0, firmware version: 0x00000000 DMCU feature version: 0, firmware version: 0x00000000 VBIOS version: 113-C8800100-102 # ii firmware-amd-graphics 20190717-2 all Binary firmware for AMD/ATI graphics chips Some pieces from dmesg: Oct 10 16:33:44 localhost kernel: [ 1.421938] [drm] amdgpu kernel modesetting enabled. Oct 10 16:33:44 localhost kernel: [ 1.421996] Parsing CRAT table with 2 nodes Oct 10 16:33:44 localhost kernel: [ 1.422003] Ignoring ACPI CRAT on non-APU system Oct 10 16:33:44 localhost kernel: [ 1.422005] Virtual CRAT table created for CPU Oct 10 16:33:44 localhost kernel: [ 1.422006] Parsing CRAT table with 2 nodes Oct 10 16:33:44 localhost kernel: [ 1.422007] Creating topology SYSFS entries Oct 10 16:33:44 localhost kernel: [ 1.422020] Topology: Add CPU node Oct 10 16:33:44 localhost kernel: [ 1.422020] Finished initializing topology Oct 10 16:33:44 localhost kernel: [ 1.422178] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 0: 0x80000000 -> 0x8fffffff Oct 10 16:33:44 localhost kernel: [ 1.422180] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 2: 0x90000000 -> 0x901fffff Oct 10 16:33:44 localhost kernel: [ 1.422181] amdgpu 0000:43:00.0: remove_conflicting_pci_framebuffers: bar 5: 0x9f800000 -> 0x9f83ffff Oct 10 16:33:44 localhost kernel: [ 1.422183] checking generic (80000000 1f0000) vs hw (80000000 10000000) Oct 10 16:33:44 localhost kernel: [ 1.422184] fb0: switching to amdgpudrmfb from EFI VGA Oct 10 16:33:44 localhost kernel: [ 1.422209] Console: switching to colour dummy device 80x25 Oct 10 16:33:44 localhost kernel: [ 1.422284] amdgpu 0000:43:00.0: vgaarb: deactivate vga console Oct 10 16:33:44 localhost kernel: [ 1.422549] [drm] initializing kernel modesetting (FIJI 0x1002:0x7300 0x1002:0x0B36 0xC8). Oct 10 16:33:44 localhost kernel: [ 1.422563] [drm] register mmio base: 0x9F800000 Oct 10 16:33:44 localhost kernel: [ 1.422563] [drm] register mmio size: 262144 Oct 10 16:33:44 localhost kernel: [ 1.422573] [drm] add ip block number 0 <vi_common> Oct 10 16:33:44 localhost kernel: [ 1.422573] [drm] add ip block number 1 <gmc_v8_0> Oct 10 16:33:44 localhost kernel: [ 1.422574] [drm] add ip block number 2 <tonga_ih> Oct 10 16:33:44 localhost kernel: [ 1.422575] [drm] add ip block number 3 <gfx_v8_0> Oct 10 16:33:44 localhost kernel: [ 1.422576] [drm] add ip block number 4 <sdma_v3_0> Oct 10 16:33:44 localhost kernel: [ 1.422577] [drm] add ip block number 5 <powerplay> Oct 10 16:33:44 localhost kernel: [ 1.422578] [drm] add ip block number 6 <dm> Oct 10 16:33:44 localhost kernel: [ 1.422579] [drm] add ip block number 7 <uvd_v6_0> Oct 10 16:33:44 localhost kernel: [ 1.422579] [drm] add ip block number 8 <vce_v3_0> Oct 10 16:33:44 localhost kernel: [ 1.422594] [drm] UVD is enabled in physical mode Oct 10 16:33:44 localhost kernel: [ 1.422595] [drm] VCE enabled in physical mode Oct 10 16:33:44 localhost kernel: [ 1.424046] ATOM BIOS: 113-C8800100-102 Oct 10 16:33:44 localhost kernel: [ 1.424070] [drm] RAS INFO: ras initialized successfully, hardware ability[0] ras_mask[0] Oct 10 16:33:44 localhost kernel: [ 1.424073] [drm] vm size is 512 GB, 2 levels, block size is 10-bit, fragment size is 9-bit Oct 10 16:33:44 localhost kernel: [ 1.424080] amdgpu 0000:43:00.0: VRAM: 4096M 0x000000F400000000 - 0x000000F4FFFFFFFF (4096M used) Oct 10 16:33:44 localhost kernel: [ 1.424081] amdgpu 0000:43:00.0: GART: 1024M 0x000000FF00000000 - 0x000000FF3FFFFFFF Oct 10 16:33:44 localhost kernel: [ 1.424085] [drm] Detected VRAM RAM=4096M, BAR=256M Oct 10 16:33:44 localhost kernel: [ 1.424086] [drm] RAM width 512bits HBM Oct 10 16:33:44 localhost kernel: [ 1.424126] [TTM] Zone kernel: Available graphics memory: 65980746 KiB Oct 10 16:33:44 localhost kernel: [ 1.424126] [TTM] Zone dma32: Available graphics memory: 2097152 KiB Oct 10 16:33:44 localhost kernel: [ 1.424127] [TTM] Initializing pool allocator Oct 10 16:33:44 localhost kernel: [ 1.424129] [TTM] Initializing DMA pool allocator Oct 10 16:33:44 localhost kernel: [ 1.424158] [drm] amdgpu: 4096M of VRAM memory ready Oct 10 16:33:44 localhost kernel: [ 1.424160] [drm] amdgpu: 4096M of GTT memory ready. Oct 10 16:33:44 localhost kernel: [ 1.424173] [drm] GART: num cpu pages 262144, num gpu pages 262144 Oct 10 16:33:44 localhost kernel: [ 1.424227] [drm] PCIE GART of 1024M enabled (table at 0x000000F4001D5000). Oct 10 16:33:44 localhost kernel: [ 1.424307] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_pfp.bin Oct 10 16:33:44 localhost kernel: [ 1.424318] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_me.bin Oct 10 16:33:44 localhost kernel: [ 1.424328] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_ce.bin Oct 10 16:33:44 localhost kernel: [ 1.424329] [drm] Chained IB support enabled! Oct 10 16:33:44 localhost kernel: [ 1.424340] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_rlc.bin Oct 10 16:33:44 localhost kernel: [ 1.424404] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_mec.bin Oct 10 16:33:44 localhost kernel: [ 1.424450] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_mec2.bin Oct 10 16:33:44 localhost kernel: [ 1.425016] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_sdma.bin Oct 10 16:33:44 localhost kernel: [ 1.425026] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_sdma1.bin Oct 10 16:33:44 localhost kernel: [ 1.425135] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_uvd.bin Oct 10 16:33:44 localhost kernel: [ 1.425136] [drm] Found UVD firmware Version: 1.91 Family ID: 12 Oct 10 16:33:44 localhost kernel: [ 1.425138] [drm] UVD ENC is disabled Oct 10 16:33:44 localhost kernel: [ 1.425579] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_vce.bin Oct 10 16:33:44 localhost kernel: [ 1.425580] [drm] Found VCE firmware Version: 55.2 Binary ID: 3 Oct 10 16:33:44 localhost kernel: [ 1.425851] amdgpu 0000:43:00.0: firmware: direct-loading firmware amdgpu/fiji_smc.bin ... Oct 10 16:33:44 localhost kernel: [ 1.496480] [drm] Display Core initialized with v3.2.27! Oct 10 16:33:44 localhost kernel: [ 1.524874] nvme nvme0: Shutdown timeout set to 8 seconds Oct 10 16:33:44 localhost kernel: [ 1.526321] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). Oct 10 16:33:44 localhost kernel: [ 1.526322] [drm] Driver supports precise vblank timestamp query. Oct 10 16:33:44 localhost kernel: [ 1.547155] nvme nvme0: 32/0/0 default/read/poll queues Oct 10 16:33:44 localhost kernel: [ 1.552924] [drm] UVD initialized successfully. Oct 10 16:33:44 localhost kernel: [ 1.653044] [drm] VCE initialized successfully. Oct 10 16:33:44 localhost kernel: [ 1.654424] kfd kfd: Allocated 3969056 bytes on gart Oct 10 16:33:44 localhost kernel: [ 1.654436] Virtual CRAT table created for GPU Oct 10 16:33:44 localhost kernel: [ 1.654437] Parsing CRAT table with 1 nodes Oct 10 16:33:44 localhost kernel: [ 1.654448] Creating topology SYSFS entries Oct 10 16:33:44 localhost kernel: [ 1.654548] Topology: Add dGPU node [0x7300:0x1002] Oct 10 16:33:44 localhost kernel: [ 1.654640] kfd kfd: added device 1002:7300 Oct 10 16:33:44 localhost kernel: [ 1.656602] [drm] fb mappable at 0x8086B000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] vram apper at 0x80000000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] size 16384000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] fb depth is 24 Oct 10 16:33:44 localhost kernel: [ 1.656604] [drm] pitch is 10240 Oct 10 16:33:44 localhost kernel: [ 1.656646] fbcon: amdgpudrmfb (fb0) is primary device Oct 10 16:33:44 localhost kernel: [ 1.678003] Console: switching to colour frame buffer device 320x100 Oct 10 16:33:44 localhost kernel: [ 1.695638] amdgpu 0000:43:00.0: fb0: amdgpudrmfb frame buffer device Oct 10 16:33:44 localhost kernel: [ 1.704979] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:43:00.0 on minor 0 ... Oct 10 16:33:44 localhost kernel: [ 1.483252] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4! ... Oct 10 16:33:44 localhost kernel: [ 1.496480] [drm] Display Core initialized with v3.2.27! Oct 10 16:33:44 localhost kernel: [ 1.524874] nvme nvme0: Shutdown timeout set to 8 seconds Oct 10 16:33:44 localhost kernel: [ 1.526321] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). Oct 10 16:33:44 localhost kernel: [ 1.526322] [drm] Driver supports precise vblank timestamp query. Oct 10 16:33:44 localhost kernel: [ 1.547155] nvme nvme0: 32/0/0 default/read/poll queues Oct 10 16:33:44 localhost kernel: [ 1.552924] [drm] UVD initialized successfully. Oct 10 16:33:44 localhost kernel: [ 1.653044] [drm] VCE initialized successfully. Oct 10 16:33:44 localhost kernel: [ 1.654424] kfd kfd: Allocated 3969056 bytes on gart Oct 10 16:33:44 localhost kernel: [ 1.654436] Virtual CRAT table created for GPU Oct 10 16:33:44 localhost kernel: [ 1.654437] Parsing CRAT table with 1 nodes Oct 10 16:33:44 localhost kernel: [ 1.654448] Creating topology SYSFS entries Oct 10 16:33:44 localhost kernel: [ 1.654548] Topology: Add dGPU node [0x7300:0x1002] Oct 10 16:33:44 localhost kernel: [ 1.654640] kfd kfd: added device 1002:7300 Oct 10 16:33:44 localhost kernel: [ 1.656602] [drm] fb mappable at 0x8086B000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] vram apper at 0x80000000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] size 16384000 Oct 10 16:33:44 localhost kernel: [ 1.656603] [drm] fb depth is 24 Oct 10 16:33:44 localhost kernel: [ 1.656604] [drm] pitch is 10240 Oct 10 16:33:44 localhost kernel: [ 1.656646] fbcon: amdgpudrmfb (fb0) is primary device Oct 10 16:33:44 localhost kernel: [ 1.678003] Console: switching to colour frame buffer device 320x100 Oct 10 16:33:44 localhost kernel: [ 1.695638] amdgpu 0000:43:00.0: fb0: amdgpudrmfb frame buffer device Oct 10 16:33:44 localhost kernel: [ 1.704979] [drm] Initialized amdgpu 3.32.0 20150101 for 0000:43:00.0 on minor 0
lspci: 43:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Fiji [Radeon R9 FURY / NANO Series] (rev c8) (prog-if 00 [VGA controller]) Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Radeon R9 FURY X / NANO Flags: bus master, fast devsel, latency 0, IRQ 65, NUMA node 1 Memory at 80000000 (64-bit, prefetchable) [size=256M] Memory at 90000000 (64-bit, prefetchable) [size=2M] I/O ports at 8000 [size=256] Memory at 9f800000 (32-bit, non-prefetchable) [size=256K] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [58] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Capabilities: [200] Resizable BAR <?> Capabilities: [270] Secondary PCI Express <?> Capabilities: [2b0] Address Translation Service (ATS) Capabilities: [2c0] Page Request Interface (PRI) Capabilities: [2d0] Process Address Space ID (PASID) Capabilities: [328] Alternative Routing-ID Interpretation (ARI) Kernel driver in use: amdgpu Kernel modules: amdgpu 43:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X] Subsystem: Advanced Micro Devices, Inc. [AMD/ATI] Fiji HDMI/DP Audio [Radeon R9 Nano / FURY/FURY X] Flags: bus master, fast devsel, latency 0, IRQ 164, NUMA node 1 Memory at 9f860000 (64-bit, non-prefetchable) [size=16K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Capabilities: [58] Express Legacy Endpoint, MSI 00 Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Capabilities: [100] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150] Advanced Error Reporting Capabilities: [328] Alternative Routing-ID Interpretation (ARI) Kernel driver in use: snd_hda_intel Kernel modules: snd_hda_intel
I just also reproduced it using Firefox with WebGL and ShaderToy with custom reasonably complex shader to load the GPU. With frame rate capped to 60 fps, so that required reasonably complex shader to load it high. I.e. 17 fps at lows, and 27 fps at highs. The fps patterns is again periodic and repeating, despite the shader not depending on time (constantly output a full screen quad with same pixel content and execution). GPU temperature is constant 28 deg C even at full blast. VDD changes from 1.25 to 1.14, 1.10, 1.05, 0.94, 0.90, and sometimes even to 0.85 V. After a while it jumps back to 1.25V and cycle repeats.
My random guess is that it is due to some bug in calculating gpu_busy_percent possibly, or SCLK_{UP,DOWN}_HYST parameters: Here is a dump of various values, when running Talos: Sun Oct 13 09:29:46 UTC 2019 /sys/class/drm/renderD128/device files: gpu_busy_percent:5 pp_num_states:states: 2 pp_num_states:0 boot pp_num_states:1 performance pp_cur_state:1 pp_power_profile_mode:NUM MODE_NAME SCLK_UP_HYST SCLK_DOWN_HYST SCLK_ACTIVE_LEVEL MCLK_UP_HYST MCLK_DOWN_HYST MCLK_ACTIVE_LEVEL pp_power_profile_mode: 0 BOOTUP_DEFAULT: - - - - - - pp_power_profile_mode: 1 3D_FULL_SCREEN: 0 100 30 0 100 10 pp_power_profile_mode: 2 POWER_SAVING: 10 0 30 - - - pp_power_profile_mode: 3 VIDEO: - - - 10 16 31 pp_power_profile_mode: 4 VR: 0 11 50 0 100 10 pp_power_profile_mode: 5 COMPUTE *: 0 5 30 0 100 10 pp_power_profile_mode: 6 CUSTOM: - - - - - - vbios_version:113-C8800100-102 revision:0xc8 mem_info_gtt_total:4294967296 mem_info_gtt_used:364883968 mem_info_vis_vram_total:268435456 mem_info_vis_vram_used:50188288 mem_info_vram_total:4294967296 mem_info_vram_used:2290094080 power_dpm_force_performance_level:manual power_dpm_state:performance hwmon/hwmon0/freq1_label:sclk hwmon/hwmon0/freq1_input:944000000 hwmon/hwmon0/freq2_label:mclk hwmon/hwmon0/freq2_input:500000000 hwmon/hwmon0/temp1_input:31000 hwmon/hwmon0/power1_average:85182000 hwmon/hwmon0/fan1_input:3069 hwmon/hwmon0/pwm1:239 hwmon/hwmon0/in0_label:vddgfx hwmon/hwmon0/in0_input:1100 pp_dpm_mclk:0: 500Mhz * pp_dpm_sclk:0: 300Mhz pp_dpm_sclk:1: 512Mhz pp_dpm_sclk:2: 724Mhz pp_dpm_sclk:3: 892Mhz pp_dpm_sclk:4: 944Mhz * pp_dpm_sclk:5: 984Mhz pp_dpm_sclk:6: 1018Mhz pp_dpm_sclk:7: 1050Mhz pp_mclk_od:0 pp_sclk_od:0 d3cold_allowed:1 power_dpm_force_performance_level:manual /sys/class/drm/ttm/buffer_objects/bo_count:2997 /sys/kernel/debug/dri/0/clients: command pid dev master a uid magic /sys/kernel/debug/dri/0/clients: Xorg 118921 0 y y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Xorg 118921 128 n y 0 0 /sys/kernel/debug/dri/0/clients: Talos 37263 128 n n 1000 0 The default pp_power_profile_mode was 1 (3D_FULL_SCREEN). I switched it just now to 5 (COMPUTE), but results are the same. The gpu_busy_percent switches every second between something close to 0 and something close to 100. Despite the game constantly submitting the work and rendering at ~200 FPS. The pp_dpm_sclk starts high and slowly goes down step by step, and eventually goes back to 7 (1050MHz), and process repeats. I tried doing: echo "7" > /sys/class/drm/card0/device/pp_dpm_sclk echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level and I see no differences, still same effect.
Kernel config, nothing special: $ grep DRM_AMD /boot/config-5.2.0-3-amd64 CONFIG_DRM_AMDGPU=m CONFIG_DRM_AMDGPU_SI=y CONFIG_DRM_AMDGPU_CIK=y CONFIG_DRM_AMDGPU_USERPTR=y # CONFIG_DRM_AMDGPU_GART_DEBUGFS is not set CONFIG_DRM_AMD_ACP=y CONFIG_DRM_AMD_DC=y CONFIG_DRM_AMD_DC_DCN1_0=y CONFIG_DRM_AMD_DC_DCN1_01=y $
Firmware signatures: user@debian:/lib/firmware/amdgpu$ sha256sum fiji_* 615693b2736f13c4ef3cd9220efe4d55df3c5d82fe128d3f1b34a45edba65fbd fiji_ce.bin b0d51dc0b361afa07bcefa0f4670c344679b1fcbe1be68c06e727eaaf0098236 fiji_mc.bin 953747f5b93bd743bb75747b950be3e4ccbe481ac1f7110a58d399ac840f158a fiji_me.bin cd1133103874ce368c4f46eeb38fe293caad5f77e4fee8567f6f6be9c47687c4 fiji_mec2.bin cd1133103874ce368c4f46eeb38fe293caad5f77e4fee8567f6f6be9c47687c4 fiji_mec.bin 91bda514a4d0d846d48321aa4d3c92ff1049fe53cbf3e007686553a29a9018de fiji_pfp.bin f0fa903f16502cff35dc073a77c1ef382f4218ec2928f23a173400888f90400d fiji_rlc.bin 1c5ab71e854cc59e4998559ed07c436d05b2a97b0df0a51a3924b1c240398949 fiji_sdma1.bin b5cf6b3a3b7e6839a68a92ac8651d53d0ae41e1caee28c68155b2d7865f1cf4c fiji_sdma.bin fd13fe6b32cef9129f1b75f46b014babcf4075ebc8a715bf19da573be8b68223 fiji_smc.bin b7401cfda1087ee5cf71acef19163f311c71775802b331ca82b9177119e4d97b fiji_uvd.bin 0fe1a2e4e2e4f6f8d5600d8c13cb60a8bc87089cd37c766fef2f95ccd5e277ac fiji_vce.bin user@debian:/lib/firmware/amdgpu$
The GPU dynamically adjusts the clocks and voltages at runtime based on the load on the particular engines or hw subsytems. You can use the pp_power_profile_mode interface to adjust the heuristics that determine how the GPU transitions through power states.
Hi Alex. I do understand that, it is a part of power management. That is not the bug is about. I did use pp_power_mode_profile too, and it doesn't really help. The issue is that I would expect the performance to stabilize and frequency and voltages to converge to satisfy the load, but they don't. The workload this is happening at isn't GPU limited (it starts with GPU load of about 40%), so it is not fully representative of other workloads, but frequency transitions looks suboptimal. Is it possible to set custom SCLK hysteresis maybe? As I said I tried settings up: echo "7" > /sys/class/drm/card0/device/pp_dpm_sclk echo "manual" > /sys/class/drm/card0/device/power_dpm_force_performance_level to confine sclk to single level, but it didn't help. I tried changing the pp_power_mode_profile to COMPUTE (just to see what happens), and there was no difference in observed behaviour. I don't have any amdgpu.ppfeaturemask sets tho, so maybe kernel driver is simply ignoring my requests.
did U 'cat pp_dpm_sclk' after 'echo 7 > ...' ? I have to boot with amdgpu.runpm=0 to be able to change anything drawback is - notebook fan goes off every2-3 min, for 30 sec, in idle :/
You can force the clocks low or high by: echo low > power_dpm_force_performance_level or echo high > power_dpm_force_performance_level setting it to auto will restore the automatic behavior: echo auto > power_dpm_force_performance_level The behavior will depend on the workload. If the workload is really bursty, it may cause the clocks to ramp up and down if there are sufficiently long idle periods between workloads. You can manually adjust the heuristics by selecting the custom profile and tweaking each parameter. See the documentation here: https://dri.freedesktop.org/docs/drm/gpu/amdgpu.html#gpu-power-thermal-controls-and-monitoring
Setting echo high > power_dpm_force_performance_level didn't help. It is set to high (verified by reading back from sysfs). gpu_busy_percent is showing me 100 all the time, and sometimes jumps a little lower, maybe 87, 95, then back to 100. Despite all of this the GPU frequency is slowly and gradually dropping down, until it reaches the lowest frequency and then jumps back to maximum. It does look like a bug to me...
It sounds like the GPU is getting throttled due to power temperature. Is the cooling solution on your GPU clean and working properly? Do you have an adequate power supply?
I also tried echo profile_peak > power_dpm_force_performance_level , and initially the sclk stays at the highest level, but after a minute, it does drop just like other profiles. As of of cooling and PSU. I am sure I have plenty of headroom. The hwmon reports ~207W power at highest clock state, but temperature as reported by hwmon stays at 39 deg C. At lower clock states it drops to 37 deg C, and to about 35 deg C at idle. Looks to be just fine to me. My PC case do have extra ventilation, and CPU is pretty much idle. My PSU is Seasonic Prime Titanium (1000W). I have: Motherboard: MSI MEG Creation X399 GPU: AMD Threadripper 2950X at stock, liquid cooled, idle (~5% load), about 46 deg C. RAM: 8x16GB Samsung DDR4 ECC GPU: AMD Radeon Fury X, water cooled NIC: Melanox ConnectX-3 Storage: 2x Samsung PM983 via U.2 Cooling: Plenty of fans in Fractal Define R6 USB-C case. I just opened the case, and verified that two 8-pin PCI-E connectors are connected to GPU and PSU. I also verified that extra (optional, even on multi GPU systems) power connectors to MB are also connected to MB and PSU. GPU radiator is above the GPU, and is not hot to the touch or anything. Fan on it is spinning at max.
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/936.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.