Bug 111482 - Sapphire Pulse RX 5700 XT power consumption
Summary: Sapphire Pulse RX 5700 XT power consumption
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-08-25 08:43 UTC by Robert
Modified: 2019-09-10 19:27 UTC (History)
0 users

See Also:
i915 platform:
i915 features:


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Robert 2019-08-25 08:43:15 UTC
Hi!

I'm mainly referring to this thread in Archlinux forum: https://bbs.archlinux.org/viewtopic.php?id=247667

I have a Sapphire Pulse RX 5700 XT and with the help of the thread above I managed to get it working. The card is not using the AMD reference implementation so it's one of the newer vendor custom design cards.

I currently have installed this software stack:

- local/linux-amd-staging-drm-next-git 5.4.857545.b4d857ded1c5-1 (which is basically this one https://cgit.freedesktop.org/~agd5f/linux/tag/?h=drm-next-5.4-2019-08-23 as I modified PKGBUILD accordingly) 
- aur/llvm-minimal-git 10.0.0_r324774.c310e5a7ab6-1
- aur/mesa-git 19.2.0_devel.114565.b2839193987-1
- firmware 2019-08-21 from https://people.freedesktop.org/~agd5f/r … de/navi10/
- core/amd-ucode 20190815.07b925b-1

Everything regarding power consumption is perfect as long as I stay in console. I've also Kernel Mode Setting (KMS) enabled. Executing "sensors" command I get this output ATM (it varies a little bit of course but basically stays around this values):

"""
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:        168 RPM  (min =    0 RPM, max = 4950 RPM)
edge:         +42.0°C  (crit = +118.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
junction:     +43.0°C  (crit = +80000.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
mem:          +50.0°C  (crit = +80000.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
power1:        8.00 W  (cap = 180.00 W)
"""

So according to this output the card uses 8 W in idle mode which is what I'm expecting (also no card fans are spinning which is great). Now if I start KDE Plasma 5 with OpenGL 3.1 backend this changes:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:         530 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +51.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +53.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +62.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       32.00 W  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

Now I have a power consumption around 32 W just by launching KDE Plasma. I didn't start anything else. Having a look at my power meter it's even more then 32 W which "sensors" is reporting (more like 40 W).

Another user in the thread mentioned above reported that for him the power consumption stays at 8 W even when KDE is running. There is only one difference: He has a card with the AMD reference design and I've a custom design card from Sapphire. So I can only suspect that there is some difference in the power play implementation. My card also has two different BIOSes and I tried both but there is no difference regarding power consumption. The monitor is connected via DisplayPort if this is of any interest. And here is my "dmesg" output (grepped for "amdgpu"):

[    1.266205] [drm] amdgpu kernel modesetting enabled.
[    1.266320] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    1.266320] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    1.266321] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf6c00000 -> 0xf6c7ffff
[    1.266322] fb0: switching to amdgpudrmfb from EFI VGA
[    1.266374] amdgpu 0000:0c:00.0: vgaarb: deactivate vga console
[    1.291100] amdgpu 0000:0c:00.0: No more image in the PCI ROM
[    1.291132] amdgpu 0000:0c:00.0: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    1.291133] amdgpu 0000:0c:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    1.291191] [drm] amdgpu: 8176M of VRAM memory ready
[    1.291192] [drm] amdgpu: 8176M of GTT memory ready.
[    2.031317] amdgpu: [powerplay] SMU is initialized successfully!
[    2.196969] fbcon: amdgpudrmfb (fb0) is primary device
[    2.302436] amdgpu 0000:0c:00.0: fb0: amdgpudrmfb frame buffer device
[    2.316727] amdgpu 0000:0c:00.0: ring 0(gfx_0.0.0) uses VM inv eng 4 on hub 0
[    2.316728] amdgpu 0000:0c:00.0: ring 1(gfx_0.1.0) uses VM inv eng 5 on hub 0
[    2.316728] amdgpu 0000:0c:00.0: ring 2(comp_1.0.0) uses VM inv eng 6 on hub 0
[    2.316729] amdgpu 0000:0c:00.0: ring 3(comp_1.1.0) uses VM inv eng 7 on hub 0
[    2.316729] amdgpu 0000:0c:00.0: ring 4(comp_1.2.0) uses VM inv eng 8 on hub 0
[    2.316730] amdgpu 0000:0c:00.0: ring 5(comp_1.3.0) uses VM inv eng 9 on hub 0
[    2.316731] amdgpu 0000:0c:00.0: ring 6(comp_1.0.1) uses VM inv eng 10 on hub 0
[    2.316731] amdgpu 0000:0c:00.0: ring 7(comp_1.1.1) uses VM inv eng 11 on hub 0
[    2.316732] amdgpu 0000:0c:00.0: ring 8(comp_1.2.1) uses VM inv eng 12 on hub 0
[    2.316733] amdgpu 0000:0c:00.0: ring 9(comp_1.3.1) uses VM inv eng 13 on hub 0
[    2.316733] amdgpu 0000:0c:00.0: ring 10(kiq_2.1.0) uses VM inv eng 14 on hub 0
[    2.316734] amdgpu 0000:0c:00.0: ring 11(sdma0) uses VM inv eng 15 on hub 0
[    2.316735] amdgpu 0000:0c:00.0: ring 12(sdma1) uses VM inv eng 16 on hub 0
[    2.316735] amdgpu 0000:0c:00.0: ring 13(vcn_dec) uses VM inv eng 4 on hub 1
[    2.316736] amdgpu 0000:0c:00.0: ring 14(vcn_enc0) uses VM inv eng 5 on hub 1
[    2.316737] amdgpu 0000:0c:00.0: ring 15(vcn_enc1) uses VM inv eng 6 on hub 1
[    2.316737] amdgpu 0000:0c:00.0: ring 16(vcn_jpeg) uses VM inv eng 7 on hub 1
[    2.316923] [drm] Initialized amdgpu 3.34.0 20150101 for 0000:0c:00.0 on minor 0
[28830.279521] amdgpu 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem

So if someone can give me any hint regarding a possible solution or something to try out I would be very thankful :-) Power isn't that cheap in Germany ;-)
Comment 1 Robert 2019-08-26 23:07:36 UTC
Not sure if it's of any use but I figured out today that after starting KDE Plasma, launching "Konsole" and typing "sensors" the output is basically garbage:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:             N/A  (min =    0 RPM, max = 3200 RPM)
edge:             N/A  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:         N/A  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:              N/A  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:           N/A  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

I can repeat this a few times and it stays the same. And I always see this errors in "dmesg" or "journalctl":

"""
[  137.931148] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18)      param: 0x00000006 response 0xffffffc2
[  137.931150] amdgpu: [powerplay] Failed to export SMU metrics table!
[  140.144885] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  142.358346] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  142.358348] amdgpu: [powerplay] Failed to export SMU metrics table!
[  144.571878] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  146.785069] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  146.785071] amdgpu: [powerplay] Failed to export SMU metrics table!
[  148.998450] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  151.211737] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  151.211738] amdgpu: [powerplay] Failed to export SMU metrics table!
[  153.425132] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  155.638843] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  155.638845] amdgpu: [powerplay] Failed to export SMU metrics table!
"""

It looks like that for every value "sensors" try to get it prints one such "failed send message..." errors.

Now the funny thing is if I start "Firefox" the screen "flickers" very shortly and afterwards "sensors" prints useful values e.g.:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:         531 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +54.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +56.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +66.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       34.00 W  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

But the problem with high idle power consumption of course doesn't change. Today I updated to the latest firmware from 2019-08-26 and also updated Mesa to 19.2-rc1. In the last post I forgot to mention that I'm currently using "libdrm-git 2.4.99.r16.g14922551-1" which is basically libdrm master branch AFAIK.

I'm really a little bit out of ideas ATM. Besides the idle power consumption thingy everything is working perfectly. Even Minecraft ;-)

Before I installed Archlinux from scratch I used a Nvidia GTX 1060 with the Nvidia binary drivers in the same host as the Sapphire card I now use wasn't released at that time. With that card I hadn't any issues with idle power consumption. It was around 8-10W while running KDE Plasma.
Comment 2 Andrew Sheldon 2019-08-27 04:47:44 UTC
I have the same problem, but with the MSI Evoke 5700 XT. If you read /sys/class/drm/card0/device/pp_dpm_mclk you should find that it's forced to the highest state (3: 875Mhz) and that although it lets you set a lower value, it immediately jumps back to the maximum value.

In theory, this problem should have been fixed with b90053edc9d6d639ddb600f8799d990d92aca328 in amd-staging-drm-next:
drm/amd/display: Support uclk switching for DCN2

but it doesn't seem to fix the problem for me. Before this, you could revert the old workaround:
02316e963a5a drm/amd/display: Force uclk to max for every state" 

and you could manually set mclk.

I should note that from some brief tests on Windows, the card also seem to be stuck at maximum mclk (it's actually even worse since temperature readings don't even work there). So it could be that aftermarket cards need some extra work, in order to work properly.

System:
Mesa git
amd-staging-drm-next (also tested 5.3-rcX and drm-next-5.4)
Comment 3 Robert 2019-08-27 07:25:02 UTC
Thanks Andrew for you comment! At least now I know that I'm not alone ;-) The funny thing is that one of the users in the Archlinux forum thread (https://bbs.archlinux.org/viewtopic.php?pid=1860353#p1860353) mentions that for him it is working with Gnome 3 under Wayland. He also has a Sapphire Pulse RX 5700 XT.

Maybe it's a combination of chipset + graphics card? I have a Asus ROG STRIX X570-E GAMING board so it has a X570 chipset.
Comment 4 Robert 2019-08-27 22:20:35 UTC
I guess it's also not of interest but if I pull the DisplayPort cable and pull it in again I get this error via "dmesg" (this happens every time I do this):

"""
[Wed Aug 28 00:12:08 2019] ------------[ cut here ]------------
[Wed Aug 28 00:12:08 2019] WARNING: CPU: 6 PID: 1995 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2800 dcn20_validate_bandwidth.cold+0xe/0x18 [amdgpu]
[Wed Aug 28 00:12:08 2019] Modules linked in: xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat br_netfilter overlay wireguard(O) ip6_udp_tunnel udp_tunnel edac_mce_amd kvm_amd sr_mod cdrom uas usb_storage stv6110x lnbp21 kvm btusb btrtl snd_hda_codec_realtek btbcm btintel nls_iso8859_1 snd_hda_codec_generic nls_cp437 ledtrig_audio snd_hda_codec_hdmi vfat fat stv090x bluetooth snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel bridge snd_hda_core eeepc_wmi asus_wmi snd_hwdep sparse_keymap ngene stp ecdh_generic snd_pcm dvb_core aesni_intel ecc videobuf2_vmalloc joydev snd_timer ccp videobuf2_memops llc video wmi_bmof mxm_wmi videobuf2_common sp5100_tco mousedev evdev aes_x86_64 input_leds snd crypto_simd led_class mac_hid cryptd glue_helper i2c_piix4 rfkill pcspkr rng_core soundcore videodev igb mc dca wmi button acpi_cpufreq nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG
[Wed Aug 28 00:12:08 2019]  xt_multiport xt_limit xt_addrtype xt_tcpudp xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nfsd nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bpfilter auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_logitech_hidpp sd_mod hid_logitech_dj hid_generic usbhid hid crc32c_intel ahci xhci_pci libahci xhci_hcd libata usbcore nvme scsi_mod usb_common nvme_core amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
[Wed Aug 28 00:12:08 2019] CPU: 6 PID: 1995 Comm: Xorg Tainted: G        W  O      5.3.0-rc3-amd-staging-drm-next-git-b8cd95e15410 #1
[Wed Aug 28 00:12:08 2019] Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 1005 08/01/2019
[Wed Aug 28 00:12:08 2019] RIP: 0010:dcn20_validate_bandwidth.cold+0xe/0x18 [amdgpu]
[Wed Aug 28 00:12:08 2019] Code: d9 05 ef e0 18 00 8b 54 24 08 0f b7 44 24 2e 80 cc 0c 66 89 44 24 2c e9 83 ed f4 ff 48 c7 c7 50 49 80 c0 31 c0 e8 dd 8b 9b cb <0f> 0b 45 89 f5 e9 5e f3 f4 ff 48 c7 c7 50 49 80 c0 31 c0 e8 c5 8b
[Wed Aug 28 00:12:08 2019] RSP: 0018:ffff9e5e09b43a98 EFLAGS: 00010246
[Wed Aug 28 00:12:08 2019] RAX: 0000000000000024 RBX: 4079400000000000 RCX: 0000000000000000
[Wed Aug 28 00:12:08 2019] RDX: 0000000000000000 RSI: ffff8e847e997448 RDI: ffff8e847e997448
[Wed Aug 28 00:12:08 2019] RBP: ffff8e8337650000 R08: ffff8e847e997448 R09: 0000000000000004
[Wed Aug 28 00:12:08 2019] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e8470d00000
[Wed Aug 28 00:12:08 2019] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8e84708e6000
[Wed Aug 28 00:12:08 2019] FS:  00007f1c3fdbfdc0(0000) GS:ffff8e847e980000(0000) knlGS:0000000000000000
[Wed Aug 28 00:12:08 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Aug 28 00:12:08 2019] CR2: 00007f4398fee528 CR3: 0000000ff4c52000 CR4: 0000000000340ee0
[Wed Aug 28 00:12:08 2019] Call Trace:
[Wed Aug 28 00:12:08 2019]  dc_validate_global_state+0x28a/0x310 [amdgpu]
[Wed Aug 28 00:12:08 2019]  amdgpu_dm_atomic_check+0x5a2/0x800 [amdgpu]
[Wed Aug 28 00:12:08 2019]  drm_atomic_check_only+0x550/0x780 [drm]
[Wed Aug 28 00:12:08 2019]  drm_atomic_commit+0x13/0x50 [drm]
[Wed Aug 28 00:12:08 2019]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[Wed Aug 28 00:12:08 2019]  drm_mode_obj_set_property_ioctl+0x159/0x2b0 [drm]
[Wed Aug 28 00:12:08 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Wed Aug 28 00:12:08 2019]  drm_connector_property_set_ioctl+0x39/0x60 [drm]
[Wed Aug 28 00:12:08 2019]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[Wed Aug 28 00:12:08 2019]  drm_ioctl+0x208/0x390 [drm]
[Wed Aug 28 00:12:08 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Wed Aug 28 00:12:08 2019]  ? ep_read_events_proc+0xd0/0xd0
[Wed Aug 28 00:12:08 2019]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[Wed Aug 28 00:12:08 2019]  do_vfs_ioctl+0x40c/0x670
[Wed Aug 28 00:12:08 2019]  ksys_ioctl+0x5e/0x90
[Wed Aug 28 00:12:08 2019]  __x64_sys_ioctl+0x16/0x20
[Wed Aug 28 00:12:08 2019]  do_syscall_64+0x4e/0x120
[Wed Aug 28 00:12:08 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Wed Aug 28 00:12:08 2019] RIP: 0033:0x7f1c411f221b
[Wed Aug 28 00:12:08 2019] Code: 0f 1e fa 48 8b 05 75 8c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 8c 0c 00 f7 d8 64 89 01 48
[Wed Aug 28 00:12:08 2019] RSP: 002b:00007ffe5a751d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Wed Aug 28 00:12:08 2019] RAX: ffffffffffffffda RBX: 00007ffe5a751da0 RCX: 00007f1c411f221b
[Wed Aug 28 00:12:08 2019] RDX: 00007ffe5a751da0 RSI: 00000000c01064ab RDI: 000000000000000d
[Wed Aug 28 00:12:08 2019] RBP: 00000000c01064ab R08: 0000000000000000 R09: 000056432099ffb0
[Wed Aug 28 00:12:08 2019] R10: 0000000000000000 R11: 0000000000000246 R12: 000056431ed03f90
[Wed Aug 28 00:12:08 2019] R13: 000000000000000d R14: 00005643226d4d70 R15: 0000000000000000
[Wed Aug 28 00:12:08 2019] ---[ end trace 7f6319103d8b887e ]---
"""

Besides the error nothing else happens. Display is still working fine afterwards (besides the still high power consumption).

As Andrew already mentioned my card also stays at the highest frequency:

"""
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz 
2: 625Mhz 
3: 875Mhz *
"""
Comment 5 Robert 2019-08-28 14:25:10 UTC
One additional observation I made yesterday: If I stop sddm/KDE via "systemctl stop sddm" the frequency has changed after I'm back in console:

"""
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz *
2: 625Mhz 
3: 875Mhz 
"""

At this point I'm also able to even go to "100MHz" and power consumption goes down at least to 13W. If I run "systemctl start sddm" again I only the the KDE Plasma start logo and that's it. Going back to console and stop sddm agin frequency is again at "875MHz" and can't be changed anymore. In this case you've to reboot.

But I guess without some developer guidance/hints there isn't much I can do anymore.
Comment 6 Andrew Sheldon 2019-09-03 09:35:41 UTC
Okay, so in my case, it turned out to be a problem with >60hz refresh rates. If I set to 60hz, the problem goes away.

sensors:
amdgpu-pci-0d00
Adapter: PCI adapter
vddgfx:       +0.72 V                                                                                                                                                                                                                                         
fan1:           0 RPM  (min =    0 RPM, max = 3200 RPM)                                                                                                                                                                                                       
edge:         +53.0°C  (crit = +118.0°C, hyst = -273.1°C)                                                                                                                                                                                                     
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
junction:     +53.0°C  (crit = +99.0°C, hyst = -273.1°C)                                                                                                                                                                                                      
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
mem:          +56.0°C  (crit = +99.0°C, hyst = -273.1°C)                                                                                                                                                                                                      
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
power1:       12.00 W  (cap = 200.00 W)

cat pp_dpm_mclk:

0: 100Mhz *                                                                                                                                                                                                                                                   
1: 500Mhz 
2: 625Mhz 
3: 875Mhz

This is a problem on Windows as well, so there looks to be a cross-platform bug here.

Also, much like Windows, 75hz is even more buggy, with lm-sensors triggering the weirdness relating to sensor data that some users have reported (N/A sensors readings, and then a lockup). Windows has a variation of this, with all sensors being unreadable when using a 75hz refresh rate (but no lockup at least).

My main refresh rate (92hz) doesn't have the latter problem, at least.
Comment 7 Robert 2019-09-03 16:54:39 UTC
Thanks Andrew for you comment! Sadly that doesn't apply to me. My 49WL95C-W is using 5120x1440 @60 Hz connected via DisplayPort. So refresh rate can't be higher then 60 Hz. The display can't have a bigger refresh rate.
Comment 8 Andrew Sheldon 2019-09-04 01:16:23 UTC
I just did some more tests, and in my case, it wasn't strictly the refresh rate, but the timings being too aggressive (which I needed to do to lower the pixel clock enough due to driver limits, which are quite conservative).

This wasn't as much of a problem with Vega, since the idle power usage was about the same (12-15W), but it is with Navi.

I will also add that during my tests, I found it was possible to leave the system in a state where I couldn't leave the high memclock/power usage situation after a while, even when switching to 60hz, requiring a reboot. So that might be what is happening on your system, Robert.
Comment 9 Robert 2019-09-04 22:16:57 UTC
Thanks Andrew, but I guess I don't know how to interpret your last comment ;-) Is there something I can test/change? I can't change the value of "/sys/class/drm/card0/device/pp_dpm_mclk" which is the memclock frequency AFAIK. It always stays at 875Mhz regardless which value I submit. As soon as I start KDE Plasma I open "konsole" and I see the 875Mhz. So the power usage is already high at this point and not after using KDE Plasma for a while. Executing 

echo "2" > /sys/class/drm/card0/device/pp_dpm_mclk

e.g. doesn't change the freqency.

Maybe it's a "KDE Plasma + X server" thingy. Users which are using "KDE Plasma + Wayland" seems less frequently affected by the problem. But that's not really an option for me as I need screen sharing from time to time via Zoom video conferencing or Slack and AFAIK that still doesn't work with Wayland (besides other problems). It's somehow funny that Wayland seems to be less of an issue than X...

Maybe it's a X570 chipset + 5700XT thingy. I've no idea. I'm not a driver developer ;-) I can only try things out like settings or patches (if I get some). But I guess this is one of the issues that some future commit will maybe fix "by accident" or it will just stay there forever ;-) If I see how long other AMD related threads stay around in this bug tracker without solution or some solution after years I currently don't have much hope that there will be any solution for my problem. Hopefully Intel launches a dedicated graphics card sometimes next year. I never had notable issues with Intel hardware and Linux within the last few years. It just works ;-)
Comment 10 Andrew Sheldon 2019-09-05 01:01:55 UTC
>I don't know how to interpret your last comment 

Yeah, I was a bit unclear. I was just indicating that while I can workaround the issue, it can still be triggered on my system as well. E.g. if I switch to 75hz, it will be stuck at 850mhz (even after switching back), so it's possible that the issue can be triggered through different ways (but the underlying issue may be the same). 

Anyway, I suspect that this bug, the one related to sensor readings (including the 75hz issue), are all related. It's most likely a video bios/firmware issue as it affects Windows as well, and some have even triggered the bug in BIOS settings, with monitors that use 75hz.

One thing you could try is booting with a window manager/DE that doesn't use any sort of hardware acceleration. That's the main difference I can figure between my system, and yours (besides the fact I use x370 instead of x570). I would also try a lower resolution just to test, as that's a pretty non-standard res, and might be another way of triggering this bug.
Comment 11 Andrew Sheldon 2019-09-05 02:20:05 UTC
One more thing to add: some users on Windows have had issues with X470/X570 PCIE4 support. The problem being that Navi advertises PCIE4 support, but doesn't actually support it properly yet, causing weird issues that potentially could result in your issue. If your BIOS has the option, try changing PCIE from Auto/4 to 3.
Comment 12 Robert 2019-09-05 14:21:54 UTC
Andrew, you're my hero ;-) While I'm even more sad now (because I now think that this issue will be indeed never be fixed) I now at least can imagine what's going on.

As you recommended I changed resolution to 1920x1080. That's quite common I would say. And tata! Indeed "sensors" reported 8W. Then I changed to 3840x2160. Not so usual I guess but the 2nd biggest resolution my monitor supports. Still 8W! And then back to 5120x1440. And tata! 33W :-(

That could mean two things: 1) There is a bug somewhere (firmware, driver, ...) with this resolution which causes that high idle power consumption or 2) which is even worse I suspect that with this resolution the 5700XT acts like you have plugged in two monitors. And from what I read throughout all Navi10 reviews multi monitor setups and power consumption was and still is a problem with AMD graphic cards in general. Oh man, that's something I really didn't calculated with :-( Yeah, in that case I can really only hope for the Intel Xe graphics cards next year (if they really build a consumer card which nobody knows yet ;-) ).

I still don't understand why in console where KMS is enabled 8W is enough and while running KDE Plasma in that resolution it takes round about 33W. But maybe it really has something to do with acceleration. It doesn't even consider reducing memory clock at least a little bit even if I do nothing and haven't even started any program. That's the funny thing in general: As long as I don't start Firefox, Thunderbird or something like that "sensors" don't even work. It just prints garbage values and takes minutes to complete. If I start one of the programs mentioned above the screen flickers very shortly and afterwards "sensors" works as expected. I guess only the AMD god knows what that means ;-) 

At least I now know that I can't do anything further. It just would be cool if one of the AMD engineers could confirm my assumption that with a resolution of 5144x1440 the card always runs at highest memory clock speed as it does with a multi monitor setup (from what I've read so far).
Comment 13 Robert 2019-09-05 14:24:41 UTC
Ah and regarding the PCIe3/4 thingy: I can't change that in the BIOS. I didn't found any configuration that allows me to change it in general or for the PCIe slot in question. But I guess that's something I don't really need to try anymore anyways.
Comment 14 Ilia Mirkin 2019-09-05 14:53:50 UTC
(In reply to Robert from comment #12)
> At least I now know that I can't do anything further. It just would be cool
> if one of the AMD engineers could confirm my assumption that with a
> resolution of 5144x1440 the card always runs at highest memory clock speed
> as it does with a multi monitor setup (from what I've read so far).

[Note, I'm not an AMD engineer.]

In some monitors, such high modes are actually exposed by presenting multiple "tiles" as separate screens. As far as the GPU is concerned, it's 2 actual monitors (this can only work with DisplayPort, of course).

Can you check if this is the case? I believe "xrandr" should report 2 separate monitors in this situation. You mention you have a 49WL95C-W, which the internet suggests is indeed just 2 panels placed next to each other in a nice plastic case.

And in such cases, I believe the AMD drivers clock to the highest rate, since reclocks will cause flickering (since the vsync's of the 2 monitors aren't sync'd to one another).
Comment 15 Robert 2019-09-05 18:16:59 UTC
Thanks Ilia for your comment! I get this output from "xrandr":

"""
Screen 0: minimum 320 x 200, current 5120 x 1440, maximum 16384 x 16384
DisplayPort-0 disconnected (normal left inverted right x axis y axis)
DisplayPort-1 disconnected (normal left inverted right x axis y axis)
DisplayPort-2 connected primary 5120x1440+0+0 (normal left inverted right x axis y axis) 1200mm x 340mm
   5120x1440     60.00 +  30.00*+
   3840x1080     60.00 +
   3840x2160     60.00    30.00  
   1920x1200     60.00  
   1920x1080     60.00    59.94  
   1600x1200     60.00  
   1680x1050     60.00  
   1600x900      60.00  
   1280x1024     60.02  
   1440x900      60.00  
   1280x800      59.81  
   1152x864      59.97  
   1280x720      60.00    59.94  
   1024x768      60.00  
   800x600       60.32  
   720x480       60.00    59.94  
   640x480       60.00    59.94  
HDMI-A-0 disconnected (normal left inverted right x axis y axis)
"""

So from what I can see only one monitor reported.

But I figured out something else: If I change the refresh rate from 60Hz to 30Hz I get 8W idle power consumption... Umpf... Now I've a big screen, kinda high end graphics card and 30Hz refresh rate :D It basically works but moving windows a little bit faster or moving the mouse pointer around looks "interesting". Haven't tested any games with that refresh rate but I guess it also looks "interesting" ;-)
Comment 16 Andrew Sheldon 2019-09-08 02:02:18 UTC
One possibility could be to create a custom modeline, perhaps trying refresh rates between 30-60hz (starting with 45hz), so you can find a point where the high idle power usage kicks in. Reduced blanking modes could be useful if it's a case of bandwidth.

See: https://github.com/kevinlekiller/cvt_modeline_calculator_12

Something like this, using just xrandr (-b option indicating reduced blanking v2 mode):

./cvt12 5120 1440 45 -b

Which yields:
Modeline "5120x1440_45.00_rb2"  344.21  5120 5128 5160 5200  1440 1457 1465 1471 +hsync -vsync

Then:
xrandr --output DisplayPort-2 --newmode "5120x1440_45.00_rb2" 344.21  5120 5128 5160 5200  1440 1457 1465 1471 +hsync -vsync

xrandr --output DisplayPort-2 --addmode DisplayPort-2 "5120x1440_45.00_rb2"

xrandr --output DisplayPort-2 --mode "5120x1440_45.00_rb2"
Comment 17 Robert 2019-09-10 19:27:43 UTC
Thanks Andrew! I played around a little bit with the refresh rates. Between 40-60Hz there is no difference in idle power consumption. The mem clock stays at 875Mhz and can't be changed.

The best refresh rate with 8W idle power consumption I could get was at 39Hz:

cvt12 5120 1440 39 -b
xrandr --output DisplayPort-2 --newmode "5120x1440_39.00_rb2" 297.51  5120 5128 5160 5200  1440 1453 1461 1467 +hsync -vsync
xrandr --output DisplayPort-2 --addmode DisplayPort-2 "5120x1440_39.00_rb2"
xrandr --output DisplayPort-2 --mode "5120x1440_39.00_rb2"

This causes the mem clock to go up to 625Mhz at first but it can be switched back to 100Mhz with

echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

Regarding my statement when using 30Hz in the last comment:

"""
It basically works but moving windows a little bit faster or moving the mouse pointer around looks "interesting".
"""

For this "flickering" that I saw and which was quite annoying I found a workaround :-) It looked like something didn't refresh fast enough. So I thought playing around with some frequencies would be a good idea... And the mem clock was the obvious one to start with. So I was setting the mem clock to 500Mhz with

echo "1" > /sys/class/drm/card0/device/pp_dpm_mclk

Then the "flickering" went away :-) But of course that brought idle power consumption to 24W. So just for fun I switched back to 100Mhz with

echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

Funny enough the "flickering" stayed away :-))) So for now after I start KDE plasma I enter Konsole and execute

echo "1" > /sys/class/drm/card0/device/pp_dpm_mclk
echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

and be happy :D

One final observation: I tried out kernel 5.3-rc8. With that kernel there is no way to reduce idle power consumption. It stays at 34W regardless what you do. But with this tag https://cgit.freedesktop.org/~agd5f/linux/tag/?h=drm-next-5.4-2019-08-30 (which basically is kernel 5.3-rc3 with the Navi10 patches for kernel 5.4 - if I got it right ;-) ) idle power consumption is as expected. 

So my whole issue basically comes down to this: If you have a resolution of 5120x1440 and a refresh rate of > 39Hz your idle power consumption stays at max and there is (at least until now) nothing you can do about it. So if I had used a lower resolution or a smaller screen I wouldn't have had an issue at all ;-) S... happens :D

But anyways: Thanks so much for your help and also to Ilia! I'm now happy with my setup so far. It would be very interesting if there is really some kind of a cap with 5120x1444@39Hz or if this this "only" a firmware problem, a driver problem, a config error or something completely different. Maybe we'll find out in our next lives :D


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.