Bug 111482

Summary:	Sapphire Pulse RX 5700 XT power consumption
Product:	DRI	Reporter:	Robert <freedesktop>
Component:	DRM/AMDgpu	Assignee:	Default DRI bug account <dri-devel>
Status:	RESOLVED MOVED	QA Contact:
Severity:	normal
Priority:	medium	CC:	asheldon55, danielkinsman.nospam, mateusz+freedesktop, popovic.marko, shtetldik
Version:	DRI git
Hardware:	x86-64 (AMD64)
OS:	Linux (All)
Whiteboard:
i915 platform:		i915 features:

Description Robert 2019-08-25 08:43:15 UTC

Hi!

I'm mainly referring to this thread in Archlinux forum: https://bbs.archlinux.org/viewtopic.php?id=247667

I have a Sapphire Pulse RX 5700 XT and with the help of the thread above I managed to get it working. The card is not using the AMD reference implementation so it's one of the newer vendor custom design cards.

I currently have installed this software stack:

- local/linux-amd-staging-drm-next-git 5.4.857545.b4d857ded1c5-1 (which is basically this one https://cgit.freedesktop.org/~agd5f/linux/tag/?h=drm-next-5.4-2019-08-23 as I modified PKGBUILD accordingly) 
- aur/llvm-minimal-git 10.0.0_r324774.c310e5a7ab6-1
- aur/mesa-git 19.2.0_devel.114565.b2839193987-1
- firmware 2019-08-21 from https://people.freedesktop.org/~agd5f/r … de/navi10/
- core/amd-ucode 20190815.07b925b-1

Everything regarding power consumption is perfect as long as I stay in console. I've also Kernel Mode Setting (KMS) enabled. Executing "sensors" command I get this output ATM (it varies a little bit of course but basically stays around this values):

"""
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:        168 RPM  (min =    0 RPM, max = 4950 RPM)
edge:         +42.0°C  (crit = +118.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
junction:     +43.0°C  (crit = +80000.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
mem:          +50.0°C  (crit = +80000.0°C, hyst =  +0.0°C)
                       (emerg = +80000.0°C)
power1:        8.00 W  (cap = 180.00 W)
"""

So according to this output the card uses 8 W in idle mode which is what I'm expecting (also no card fans are spinning which is great). Now if I start KDE Plasma 5 with OpenGL 3.1 backend this changes:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:         530 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +51.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +53.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +62.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       32.00 W  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

Now I have a power consumption around 32 W just by launching KDE Plasma. I didn't start anything else. Having a look at my power meter it's even more then 32 W which "sensors" is reporting (more like 40 W).

Another user in the thread mentioned above reported that for him the power consumption stays at 8 W even when KDE is running. There is only one difference: He has a card with the AMD reference design and I've a custom design card from Sapphire. So I can only suspect that there is some difference in the power play implementation. My card also has two different BIOSes and I tried both but there is no difference regarding power consumption. The monitor is connected via DisplayPort if this is of any interest. And here is my "dmesg" output (grepped for "amdgpu"):

[    1.266205] [drm] amdgpu kernel modesetting enabled.
[    1.266320] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 0: 0xe0000000 -> 0xefffffff
[    1.266320] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 2: 0xf0000000 -> 0xf01fffff
[    1.266321] amdgpu 0000:0c:00.0: remove_conflicting_pci_framebuffers: bar 5: 0xf6c00000 -> 0xf6c7ffff
[    1.266322] fb0: switching to amdgpudrmfb from EFI VGA
[    1.266374] amdgpu 0000:0c:00.0: vgaarb: deactivate vga console
[    1.291100] amdgpu 0000:0c:00.0: No more image in the PCI ROM
[    1.291132] amdgpu 0000:0c:00.0: VRAM: 8176M 0x0000008000000000 - 0x00000081FEFFFFFF (8176M used)
[    1.291133] amdgpu 0000:0c:00.0: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
[    1.291191] [drm] amdgpu: 8176M of VRAM memory ready
[    1.291192] [drm] amdgpu: 8176M of GTT memory ready.
[    2.031317] amdgpu: [powerplay] SMU is initialized successfully!
[    2.196969] fbcon: amdgpudrmfb (fb0) is primary device
[    2.302436] amdgpu 0000:0c:00.0: fb0: amdgpudrmfb frame buffer device
[    2.316727] amdgpu 0000:0c:00.0: ring 0(gfx_0.0.0) uses VM inv eng 4 on hub 0
[    2.316728] amdgpu 0000:0c:00.0: ring 1(gfx_0.1.0) uses VM inv eng 5 on hub 0
[    2.316728] amdgpu 0000:0c:00.0: ring 2(comp_1.0.0) uses VM inv eng 6 on hub 0
[    2.316729] amdgpu 0000:0c:00.0: ring 3(comp_1.1.0) uses VM inv eng 7 on hub 0
[    2.316729] amdgpu 0000:0c:00.0: ring 4(comp_1.2.0) uses VM inv eng 8 on hub 0
[    2.316730] amdgpu 0000:0c:00.0: ring 5(comp_1.3.0) uses VM inv eng 9 on hub 0
[    2.316731] amdgpu 0000:0c:00.0: ring 6(comp_1.0.1) uses VM inv eng 10 on hub 0
[    2.316731] amdgpu 0000:0c:00.0: ring 7(comp_1.1.1) uses VM inv eng 11 on hub 0
[    2.316732] amdgpu 0000:0c:00.0: ring 8(comp_1.2.1) uses VM inv eng 12 on hub 0
[    2.316733] amdgpu 0000:0c:00.0: ring 9(comp_1.3.1) uses VM inv eng 13 on hub 0
[    2.316733] amdgpu 0000:0c:00.0: ring 10(kiq_2.1.0) uses VM inv eng 14 on hub 0
[    2.316734] amdgpu 0000:0c:00.0: ring 11(sdma0) uses VM inv eng 15 on hub 0
[    2.316735] amdgpu 0000:0c:00.0: ring 12(sdma1) uses VM inv eng 16 on hub 0
[    2.316735] amdgpu 0000:0c:00.0: ring 13(vcn_dec) uses VM inv eng 4 on hub 1
[    2.316736] amdgpu 0000:0c:00.0: ring 14(vcn_enc0) uses VM inv eng 5 on hub 1
[    2.316737] amdgpu 0000:0c:00.0: ring 15(vcn_enc1) uses VM inv eng 6 on hub 1
[    2.316737] amdgpu 0000:0c:00.0: ring 16(vcn_jpeg) uses VM inv eng 7 on hub 1
[    2.316923] [drm] Initialized amdgpu 3.34.0 20150101 for 0000:0c:00.0 on minor 0
[28830.279521] amdgpu 0000:0c:00.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=none:owns=io+mem

So if someone can give me any hint regarding a possible solution or something to try out I would be very thankful :-) Power isn't that cheap in Germany ;-)

Comment 1 Robert 2019-08-26 23:07:36 UTC

Not sure if it's of any use but I figured out today that after starting KDE Plasma, launching "Konsole" and typing "sensors" the output is basically garbage:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:             N/A  (min =    0 RPM, max = 3200 RPM)
edge:             N/A  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:         N/A  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:              N/A  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:           N/A  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

I can repeat this a few times and it stays the same. And I always see this errors in "dmesg" or "journalctl":

"""
[  137.931148] amdgpu: [powerplay] failed send message: TransferTableSmu2Dram (18)      param: 0x00000006 response 0xffffffc2
[  137.931150] amdgpu: [powerplay] Failed to export SMU metrics table!
[  140.144885] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  142.358346] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  142.358348] amdgpu: [powerplay] Failed to export SMU metrics table!
[  144.571878] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  146.785069] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  146.785071] amdgpu: [powerplay] Failed to export SMU metrics table!
[  148.998450] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  151.211737] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  151.211738] amdgpu: [powerplay] Failed to export SMU metrics table!
[  153.425132] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  155.638843] amdgpu: [powerplay] failed send message: SetDriverDramAddrHigh (14)      param: 0x00000080 response 0xffffffc2
[  155.638845] amdgpu: [powerplay] Failed to export SMU metrics table!
"""

It looks like that for every value "sensors" try to get it prints one such "failed send message..." errors.

Now the funny thing is if I start "Firefox" the screen "flickers" very shortly and afterwards "sensors" prints useful values e.g.:

"""
amdgpu-pci-0c00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:         531 RPM  (min =    0 RPM, max = 3200 RPM)
edge:         +54.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +56.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +66.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       34.00 W  (cap = 180.00 W)

asus-isa-0000
Adapter: ISA adapter
cpu_fan:        0 RPM
"""

But the problem with high idle power consumption of course doesn't change. Today I updated to the latest firmware from 2019-08-26 and also updated Mesa to 19.2-rc1. In the last post I forgot to mention that I'm currently using "libdrm-git 2.4.99.r16.g14922551-1" which is basically libdrm master branch AFAIK.

I'm really a little bit out of ideas ATM. Besides the idle power consumption thingy everything is working perfectly. Even Minecraft ;-)

Before I installed Archlinux from scratch I used a Nvidia GTX 1060 with the Nvidia binary drivers in the same host as the Sapphire card I now use wasn't released at that time. With that card I hadn't any issues with idle power consumption. It was around 8-10W while running KDE Plasma.

Comment 2 Andrew Sheldon 2019-08-27 04:47:44 UTC

I have the same problem, but with the MSI Evoke 5700 XT. If you read /sys/class/drm/card0/device/pp_dpm_mclk you should find that it's forced to the highest state (3: 875Mhz) and that although it lets you set a lower value, it immediately jumps back to the maximum value.

In theory, this problem should have been fixed with b90053edc9d6d639ddb600f8799d990d92aca328 in amd-staging-drm-next:
drm/amd/display: Support uclk switching for DCN2

but it doesn't seem to fix the problem for me. Before this, you could revert the old workaround:
02316e963a5a drm/amd/display: Force uclk to max for every state" 

and you could manually set mclk.

I should note that from some brief tests on Windows, the card also seem to be stuck at maximum mclk (it's actually even worse since temperature readings don't even work there). So it could be that aftermarket cards need some extra work, in order to work properly.

System:
Mesa git
amd-staging-drm-next (also tested 5.3-rcX and drm-next-5.4)

Comment 3 Robert 2019-08-27 07:25:02 UTC

Thanks Andrew for you comment! At least now I know that I'm not alone ;-) The funny thing is that one of the users in the Archlinux forum thread (https://bbs.archlinux.org/viewtopic.php?pid=1860353#p1860353) mentions that for him it is working with Gnome 3 under Wayland. He also has a Sapphire Pulse RX 5700 XT.

Maybe it's a combination of chipset + graphics card? I have a Asus ROG STRIX X570-E GAMING board so it has a X570 chipset.

Comment 4 Robert 2019-08-27 22:20:35 UTC

I guess it's also not of interest but if I pull the DisplayPort cable and pull it in again I get this error via "dmesg" (this happens every time I do this):

"""
[Wed Aug 28 00:12:08 2019] ------------[ cut here ]------------
[Wed Aug 28 00:12:08 2019] WARNING: CPU: 6 PID: 1995 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2800 dcn20_validate_bandwidth.cold+0xe/0x18 [amdgpu]
[Wed Aug 28 00:12:08 2019] Modules linked in: xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat br_netfilter overlay wireguard(O) ip6_udp_tunnel udp_tunnel edac_mce_amd kvm_amd sr_mod cdrom uas usb_storage stv6110x lnbp21 kvm btusb btrtl snd_hda_codec_realtek btbcm btintel nls_iso8859_1 snd_hda_codec_generic nls_cp437 ledtrig_audio snd_hda_codec_hdmi vfat fat stv090x bluetooth snd_hda_intel crct10dif_pclmul crc32_pclmul snd_hda_codec ghash_clmulni_intel bridge snd_hda_core eeepc_wmi asus_wmi snd_hwdep sparse_keymap ngene stp ecdh_generic snd_pcm dvb_core aesni_intel ecc videobuf2_vmalloc joydev snd_timer ccp videobuf2_memops llc video wmi_bmof mxm_wmi videobuf2_common sp5100_tco mousedev evdev aes_x86_64 input_leds snd crypto_simd led_class mac_hid cryptd glue_helper i2c_piix4 rfkill pcspkr rng_core soundcore videodev igb mc dca wmi button acpi_cpufreq nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG
[Wed Aug 28 00:12:08 2019]  xt_multiport xt_limit xt_addrtype xt_tcpudp xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nfsd nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter bpfilter auth_rpcgss nfs_acl lockd grace sch_fq_codel sunrpc sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_logitech_hidpp sd_mod hid_logitech_dj hid_generic usbhid hid crc32c_intel ahci xhci_pci libahci xhci_hcd libata usbcore nvme scsi_mod usb_common nvme_core amdgpu amd_iommu_v2 gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio
[Wed Aug 28 00:12:08 2019] CPU: 6 PID: 1995 Comm: Xorg Tainted: G        W  O      5.3.0-rc3-amd-staging-drm-next-git-b8cd95e15410 #1
[Wed Aug 28 00:12:08 2019] Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 1005 08/01/2019
[Wed Aug 28 00:12:08 2019] RIP: 0010:dcn20_validate_bandwidth.cold+0xe/0x18 [amdgpu]
[Wed Aug 28 00:12:08 2019] Code: d9 05 ef e0 18 00 8b 54 24 08 0f b7 44 24 2e 80 cc 0c 66 89 44 24 2c e9 83 ed f4 ff 48 c7 c7 50 49 80 c0 31 c0 e8 dd 8b 9b cb <0f> 0b 45 89 f5 e9 5e f3 f4 ff 48 c7 c7 50 49 80 c0 31 c0 e8 c5 8b
[Wed Aug 28 00:12:08 2019] RSP: 0018:ffff9e5e09b43a98 EFLAGS: 00010246
[Wed Aug 28 00:12:08 2019] RAX: 0000000000000024 RBX: 4079400000000000 RCX: 0000000000000000
[Wed Aug 28 00:12:08 2019] RDX: 0000000000000000 RSI: ffff8e847e997448 RDI: ffff8e847e997448
[Wed Aug 28 00:12:08 2019] RBP: ffff8e8337650000 R08: ffff8e847e997448 R09: 0000000000000004
[Wed Aug 28 00:12:08 2019] R10: 0000000000000000 R11: 0000000000000001 R12: ffff8e8470d00000
[Wed Aug 28 00:12:08 2019] R13: 0000000000000000 R14: 0000000000000001 R15: ffff8e84708e6000
[Wed Aug 28 00:12:08 2019] FS:  00007f1c3fdbfdc0(0000) GS:ffff8e847e980000(0000) knlGS:0000000000000000
[Wed Aug 28 00:12:08 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Wed Aug 28 00:12:08 2019] CR2: 00007f4398fee528 CR3: 0000000ff4c52000 CR4: 0000000000340ee0
[Wed Aug 28 00:12:08 2019] Call Trace:
[Wed Aug 28 00:12:08 2019]  dc_validate_global_state+0x28a/0x310 [amdgpu]
[Wed Aug 28 00:12:08 2019]  amdgpu_dm_atomic_check+0x5a2/0x800 [amdgpu]
[Wed Aug 28 00:12:08 2019]  drm_atomic_check_only+0x550/0x780 [drm]
[Wed Aug 28 00:12:08 2019]  drm_atomic_commit+0x13/0x50 [drm]
[Wed Aug 28 00:12:08 2019]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[Wed Aug 28 00:12:08 2019]  drm_mode_obj_set_property_ioctl+0x159/0x2b0 [drm]
[Wed Aug 28 00:12:08 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Wed Aug 28 00:12:08 2019]  drm_connector_property_set_ioctl+0x39/0x60 [drm]
[Wed Aug 28 00:12:08 2019]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[Wed Aug 28 00:12:08 2019]  drm_ioctl+0x208/0x390 [drm]
[Wed Aug 28 00:12:08 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Wed Aug 28 00:12:08 2019]  ? ep_read_events_proc+0xd0/0xd0
[Wed Aug 28 00:12:08 2019]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[Wed Aug 28 00:12:08 2019]  do_vfs_ioctl+0x40c/0x670
[Wed Aug 28 00:12:08 2019]  ksys_ioctl+0x5e/0x90
[Wed Aug 28 00:12:08 2019]  __x64_sys_ioctl+0x16/0x20
[Wed Aug 28 00:12:08 2019]  do_syscall_64+0x4e/0x120
[Wed Aug 28 00:12:08 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Wed Aug 28 00:12:08 2019] RIP: 0033:0x7f1c411f221b
[Wed Aug 28 00:12:08 2019] Code: 0f 1e fa 48 8b 05 75 8c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 45 8c 0c 00 f7 d8 64 89 01 48
[Wed Aug 28 00:12:08 2019] RSP: 002b:00007ffe5a751d68 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Wed Aug 28 00:12:08 2019] RAX: ffffffffffffffda RBX: 00007ffe5a751da0 RCX: 00007f1c411f221b
[Wed Aug 28 00:12:08 2019] RDX: 00007ffe5a751da0 RSI: 00000000c01064ab RDI: 000000000000000d
[Wed Aug 28 00:12:08 2019] RBP: 00000000c01064ab R08: 0000000000000000 R09: 000056432099ffb0
[Wed Aug 28 00:12:08 2019] R10: 0000000000000000 R11: 0000000000000246 R12: 000056431ed03f90
[Wed Aug 28 00:12:08 2019] R13: 000000000000000d R14: 00005643226d4d70 R15: 0000000000000000
[Wed Aug 28 00:12:08 2019] ---[ end trace 7f6319103d8b887e ]---
"""

Besides the error nothing else happens. Display is still working fine afterwards (besides the still high power consumption).

As Andrew already mentioned my card also stays at the highest frequency:

"""
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz 
2: 625Mhz 
3: 875Mhz *
"""

Comment 5 Robert 2019-08-28 14:25:10 UTC

One additional observation I made yesterday: If I stop sddm/KDE via "systemctl stop sddm" the frequency has changed after I'm back in console:

"""
cat /sys/class/drm/card0/device/pp_dpm_mclk
0: 100Mhz 
1: 500Mhz *
2: 625Mhz 
3: 875Mhz 
"""

At this point I'm also able to even go to "100MHz" and power consumption goes down at least to 13W. If I run "systemctl start sddm" again I only the the KDE Plasma start logo and that's it. Going back to console and stop sddm agin frequency is again at "875MHz" and can't be changed anymore. In this case you've to reboot.

But I guess without some developer guidance/hints there isn't much I can do anymore.

Comment 6 Andrew Sheldon 2019-09-03 09:35:41 UTC

Okay, so in my case, it turned out to be a problem with >60hz refresh rates. If I set to 60hz, the problem goes away.

sensors:
amdgpu-pci-0d00
Adapter: PCI adapter
vddgfx:       +0.72 V                                                                                                                                                                                                                                         
fan1:           0 RPM  (min =    0 RPM, max = 3200 RPM)                                                                                                                                                                                                       
edge:         +53.0°C  (crit = +118.0°C, hyst = -273.1°C)                                                                                                                                                                                                     
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
junction:     +53.0°C  (crit = +99.0°C, hyst = -273.1°C)                                                                                                                                                                                                      
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
mem:          +56.0°C  (crit = +99.0°C, hyst = -273.1°C)                                                                                                                                                                                                      
                       (emerg = +99.0°C)                                                                                                                                                                                                                      
power1:       12.00 W  (cap = 200.00 W)

cat pp_dpm_mclk:

0: 100Mhz *                                                                                                                                                                                                                                                   
1: 500Mhz 
2: 625Mhz 
3: 875Mhz

This is a problem on Windows as well, so there looks to be a cross-platform bug here.

Also, much like Windows, 75hz is even more buggy, with lm-sensors triggering the weirdness relating to sensor data that some users have reported (N/A sensors readings, and then a lockup). Windows has a variation of this, with all sensors being unreadable when using a 75hz refresh rate (but no lockup at least).

My main refresh rate (92hz) doesn't have the latter problem, at least.

Comment 7 Robert 2019-09-03 16:54:39 UTC

Thanks Andrew for you comment! Sadly that doesn't apply to me. My 49WL95C-W is using 5120x1440 @60 Hz connected via DisplayPort. So refresh rate can't be higher then 60 Hz. The display can't have a bigger refresh rate.

Comment 8 Andrew Sheldon 2019-09-04 01:16:23 UTC

I just did some more tests, and in my case, it wasn't strictly the refresh rate, but the timings being too aggressive (which I needed to do to lower the pixel clock enough due to driver limits, which are quite conservative).

This wasn't as much of a problem with Vega, since the idle power usage was about the same (12-15W), but it is with Navi.

I will also add that during my tests, I found it was possible to leave the system in a state where I couldn't leave the high memclock/power usage situation after a while, even when switching to 60hz, requiring a reboot. So that might be what is happening on your system, Robert.

Comment 9 Robert 2019-09-04 22:16:57 UTC

Thanks Andrew, but I guess I don't know how to interpret your last comment ;-) Is there something I can test/change? I can't change the value of "/sys/class/drm/card0/device/pp_dpm_mclk" which is the memclock frequency AFAIK. It always stays at 875Mhz regardless which value I submit. As soon as I start KDE Plasma I open "konsole" and I see the 875Mhz. So the power usage is already high at this point and not after using KDE Plasma for a while. Executing

echo "2" > /sys/class/drm/card0/device/pp_dpm_mclk

e.g. doesn't change the freqency.

Maybe it's a "KDE Plasma + X server" thingy. Users which are using "KDE Plasma + Wayland" seems less frequently affected by the problem. But that's not really an option for me as I need screen sharing from time to time via Zoom video conferencing or Slack and AFAIK that still doesn't work with Wayland (besides other problems). It's somehow funny that Wayland seems to be less of an issue than X...

Maybe it's a X570 chipset + 5700XT thingy. I've no idea. I'm not a driver developer ;-) I can only try things out like settings or patches (if I get some). But I guess this is one of the issues that some future commit will maybe fix "by accident" or it will just stay there forever ;-) If I see how long other AMD related threads stay around in this bug tracker without solution or some solution after years I currently don't have much hope that there will be any solution for my problem. Hopefully Intel launches a dedicated graphics card sometimes next year. I never had notable issues with Intel hardware and Linux within the last few years. It just works ;-)

Comment 10 Andrew Sheldon 2019-09-05 01:01:55 UTC

>I don't know how to interpret your last comment 

Yeah, I was a bit unclear. I was just indicating that while I can workaround the issue, it can still be triggered on my system as well. E.g. if I switch to 75hz, it will be stuck at 850mhz (even after switching back), so it's possible that the issue can be triggered through different ways (but the underlying issue may be the same). 

Anyway, I suspect that this bug, the one related to sensor readings (including the 75hz issue), are all related. It's most likely a video bios/firmware issue as it affects Windows as well, and some have even triggered the bug in BIOS settings, with monitors that use 75hz.

One thing you could try is booting with a window manager/DE that doesn't use any sort of hardware acceleration. That's the main difference I can figure between my system, and yours (besides the fact I use x370 instead of x570). I would also try a lower resolution just to test, as that's a pretty non-standard res, and might be another way of triggering this bug.

Comment 11 Andrew Sheldon 2019-09-05 02:20:05 UTC

One more thing to add: some users on Windows have had issues with X470/X570 PCIE4 support. The problem being that Navi advertises PCIE4 support, but doesn't actually support it properly yet, causing weird issues that potentially could result in your issue. If your BIOS has the option, try changing PCIE from Auto/4 to 3.

Comment 12 Robert 2019-09-05 14:21:54 UTC

Andrew, you're my hero ;-) While I'm even more sad now (because I now think that this issue will be indeed never be fixed) I now at least can imagine what's going on.

As you recommended I changed resolution to 1920x1080. That's quite common I would say. And tata! Indeed "sensors" reported 8W. Then I changed to 3840x2160. Not so usual I guess but the 2nd biggest resolution my monitor supports. Still 8W! And then back to 5120x1440. And tata! 33W :-(

That could mean two things: 1) There is a bug somewhere (firmware, driver, ...) with this resolution which causes that high idle power consumption or 2) which is even worse I suspect that with this resolution the 5700XT acts like you have plugged in two monitors. And from what I read throughout all Navi10 reviews multi monitor setups and power consumption was and still is a problem with AMD graphic cards in general. Oh man, that's something I really didn't calculated with :-( Yeah, in that case I can really only hope for the Intel Xe graphics cards next year (if they really build a consumer card which nobody knows yet ;-) ).

I still don't understand why in console where KMS is enabled 8W is enough and while running KDE Plasma in that resolution it takes round about 33W. But maybe it really has something to do with acceleration. It doesn't even consider reducing memory clock at least a little bit even if I do nothing and haven't even started any program. That's the funny thing in general: As long as I don't start Firefox, Thunderbird or something like that "sensors" don't even work. It just prints garbage values and takes minutes to complete. If I start one of the programs mentioned above the screen flickers very shortly and afterwards "sensors" works as expected. I guess only the AMD god knows what that means ;-)

At least I now know that I can't do anything further. It just would be cool if one of the AMD engineers could confirm my assumption that with a resolution of 5144x1440 the card always runs at highest memory clock speed as it does with a multi monitor setup (from what I've read so far).

Comment 13 Robert 2019-09-05 14:24:41 UTC

Ah and regarding the PCIe3/4 thingy: I can't change that in the BIOS. I didn't found any configuration that allows me to change it in general or for the PCIe slot in question. But I guess that's something I don't really need to try anymore anyways.

Comment 14 Ilia Mirkin 2019-09-05 14:53:50 UTC

(In reply to Robert from comment #12)
> At least I now know that I can't do anything further. It just would be cool
> if one of the AMD engineers could confirm my assumption that with a
> resolution of 5144x1440 the card always runs at highest memory clock speed
> as it does with a multi monitor setup (from what I've read so far).

[Note, I'm not an AMD engineer.]

In some monitors, such high modes are actually exposed by presenting multiple "tiles" as separate screens. As far as the GPU is concerned, it's 2 actual monitors (this can only work with DisplayPort, of course).

Can you check if this is the case? I believe "xrandr" should report 2 separate monitors in this situation. You mention you have a 49WL95C-W, which the internet suggests is indeed just 2 panels placed next to each other in a nice plastic case.

And in such cases, I believe the AMD drivers clock to the highest rate, since reclocks will cause flickering (since the vsync's of the 2 monitors aren't sync'd to one another).

Comment 15 Robert 2019-09-05 18:16:59 UTC

Thanks Ilia for your comment! I get this output from "xrandr":

"""
Screen 0: minimum 320 x 200, current 5120 x 1440, maximum 16384 x 16384
DisplayPort-0 disconnected (normal left inverted right x axis y axis)
DisplayPort-1 disconnected (normal left inverted right x axis y axis)
DisplayPort-2 connected primary 5120x1440+0+0 (normal left inverted right x axis y axis) 1200mm x 340mm
   5120x1440     60.00 +  30.00*+
   3840x1080     60.00 +
   3840x2160     60.00    30.00  
   1920x1200     60.00  
   1920x1080     60.00    59.94  
   1600x1200     60.00  
   1680x1050     60.00  
   1600x900      60.00  
   1280x1024     60.02  
   1440x900      60.00  
   1280x800      59.81  
   1152x864      59.97  
   1280x720      60.00    59.94  
   1024x768      60.00  
   800x600       60.32  
   720x480       60.00    59.94  
   640x480       60.00    59.94  
HDMI-A-0 disconnected (normal left inverted right x axis y axis)
"""

So from what I can see only one monitor reported.

But I figured out something else: If I change the refresh rate from 60Hz to 30Hz I get 8W idle power consumption... Umpf... Now I've a big screen, kinda high end graphics card and 30Hz refresh rate :D It basically works but moving windows a little bit faster or moving the mouse pointer around looks "interesting". Haven't tested any games with that refresh rate but I guess it also looks "interesting" ;-)

Comment 16 Andrew Sheldon 2019-09-08 02:02:18 UTC

One possibility could be to create a custom modeline, perhaps trying refresh rates between 30-60hz (starting with 45hz), so you can find a point where the high idle power usage kicks in. Reduced blanking modes could be useful if it's a case of bandwidth.

See: https://github.com/kevinlekiller/cvt_modeline_calculator_12

Something like this, using just xrandr (-b option indicating reduced blanking v2 mode):

./cvt12 5120 1440 45 -b

Which yields:
Modeline "5120x1440_45.00_rb2"  344.21  5120 5128 5160 5200  1440 1457 1465 1471 +hsync -vsync

Then:
xrandr --output DisplayPort-2 --newmode "5120x1440_45.00_rb2" 344.21  5120 5128 5160 5200  1440 1457 1465 1471 +hsync -vsync

xrandr --output DisplayPort-2 --addmode DisplayPort-2 "5120x1440_45.00_rb2"

xrandr --output DisplayPort-2 --mode "5120x1440_45.00_rb2"

Comment 17 Robert 2019-09-10 19:27:43 UTC

Thanks Andrew! I played around a little bit with the refresh rates. Between 40-60Hz there is no difference in idle power consumption. The mem clock stays at 875Mhz and can't be changed.

The best refresh rate with 8W idle power consumption I could get was at 39Hz:

cvt12 5120 1440 39 -b
xrandr --output DisplayPort-2 --newmode "5120x1440_39.00_rb2" 297.51 5120 5128 5160 5200 1440 1453 1461 1467 +hsync -vsync
xrandr --output DisplayPort-2 --addmode DisplayPort-2 "5120x1440_39.00_rb2"
xrandr --output DisplayPort-2 --mode "5120x1440_39.00_rb2"

This causes the mem clock to go up to 625Mhz at first but it can be switched back to 100Mhz with

echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

Regarding my statement when using 30Hz in the last comment:

"""
It basically works but moving windows a little bit faster or moving the mouse pointer around looks "interesting".
"""

For this "flickering" that I saw and which was quite annoying I found a workaround :-) It looked like something didn't refresh fast enough. So I thought playing around with some frequencies would be a good idea... And the mem clock was the obvious one to start with. So I was setting the mem clock to 500Mhz with

echo "1" > /sys/class/drm/card0/device/pp_dpm_mclk

Then the "flickering" went away :-) But of course that brought idle power consumption to 24W. So just for fun I switched back to 100Mhz with

echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

Funny enough the "flickering" stayed away :-))) So for now after I start KDE plasma I enter Konsole and execute

echo "1" > /sys/class/drm/card0/device/pp_dpm_mclk
echo "0" > /sys/class/drm/card0/device/pp_dpm_mclk

and be happy :D

One final observation: I tried out kernel 5.3-rc8. With that kernel there is no way to reduce idle power consumption. It stays at 34W regardless what you do. But with this tag https://cgit.freedesktop.org/~agd5f/linux/tag/?h=drm-next-5.4-2019-08-30 (which basically is kernel 5.3-rc3 with the Navi10 patches for kernel 5.4 - if I got it right ;-) ) idle power consumption is as expected.

So my whole issue basically comes down to this: If you have a resolution of 5120x1440 and a refresh rate of > 39Hz your idle power consumption stays at max and there is (at least until now) nothing you can do about it. So if I had used a lower resolution or a smaller screen I wouldn't have had an issue at all ;-) S... happens :D

But anyways: Thanks so much for your help and also to Ilia! I'm now happy with my setup so far. It would be very interesting if there is really some kind of a cap with 5120x1444@39Hz or if this this "only" a firmware problem, a driver problem, a config error or something completely different. Maybe we'll find out in our next lives :D

Comment 18 Leon 2019-09-27 10:49:21 UTC

I have the same problem. Sapphire 5700 XT Nitro, x470 motherboard (asrock taichi), running arch with kernel 5.3.1. My resolution is 2560x1440 144Hz, with 30Watts idle and 70 Celsius at the memory :( ... Unlike you changing the refresh rate doesn't seem to improve anything though, and I don't have the same problem using windows 10.

Comment 19 Leon 2019-09-27 10:51:49 UTC

By the way, since I have a x470 mb, it cannot be related to PCI express 4.0. It's also not related to dual displays, since I'm running just one.

Comment 20 Andrew Sheldon 2019-09-28 07:38:08 UTC

(In reply to Leon from comment #19)
> By the way, since I have a x470 mb, it cannot be related to PCI express 4.0.
> It's also not related to dual displays, since I'm running just one.

Not necessarily. Some x370 and x470 motherboards erroneously reported PCIE 4.0 support in earlier BIOS (AGESA) updates. You might want to update to the latest available bios, if that is the case with your board.

Although, since you are using 5.3.1 I don't think the bug has been fixed in mainline 5.3 yet. You might want to use amd-staging-drm-next, or wait for 4.1-rc1.

Comment 21 Robert 2019-09-28 09:05:31 UTC

(In reply to Leon from comment #18)
> I have the same problem. Sapphire 5700 XT Nitro, x470 motherboard (asrock
> taichi), running arch with kernel 5.3.1. My resolution is 2560x1440 144Hz,
> with 30Watts idle and 70 Celsius at the memory :( ... Unlike you changing
> the refresh rate doesn't seem to improve anything though, and I don't have
> the same problem using windows 10.

You definitely need to either wait for kernel 5.4rc1 for the idle power consumption thingy to be fixed or if you use Archlinux you can use this solution for now: https://bbs.archlinux.org/viewtopic.php?pid=1865600#p1865600 Or you compile this kernel source on your own: https://cgit.freedesktop.org/~agd5f/linux/tag/?h=drm-next-5.4-2019-08-30

Kernel 5.3 has a fixed setting for idle power consumption. I haven't found any way to reduce idle power consumption with this kernel version.

Comment 22 Leon 2019-09-28 10:14:08 UTC

(In reply to Andrew Sheldon from comment #20)
> (In reply to Leon from comment #19)
> > By the way, since I have a x470 mb, it cannot be related to PCI express 4.0.
> > It's also not related to dual displays, since I'm running just one.
> 
> Not necessarily. Some x370 and x470 motherboards erroneously reported PCIE
> 4.0 support in earlier BIOS (AGESA) updates. You might want to update to the
> latest available bios, if that is the case with your board.
> 
> Although, since you are using 5.3.1 I don't think the bug has been fixed in
> mainline 5.3 yet. You might want to use amd-staging-drm-next, or wait for
> 4.1-rc1.

Already did even before installing the 5700 XT. Running bios P3.60 with AM4 1.0.0.3 ABB AGESA.

Thanks for the kernel suggestion though!

Comment 23 Andrew Sheldon 2019-10-05 11:55:29 UTC

@Leon

I suspect there is more than one bug occurring. The main Navi-specific issue has been fixed with newer kernels (that affected everyone), but there is another issue relating to high resolution and high refresh rate monitors, that looks to affect at least Navi, Vega (and probably Polaris going by other reports).

The secondary issue is probably by design to an extent. High res/refresh rate requires a lot more bandwidth which needs a higher memory clock. However, I suspect there are two problems within this:

- Once a high bandwidth mode is used and the maximum memory clock is chosen, it never switches down again (even if you switch to a lower bandwidth mode). Particularly, if you boot at 2560x1440@144hz, you won't be able to switch down again.
- The choice of memory clock is higher than it needs to be, even for high bandwidth modes

You can workaround this to some extent on Vega by writing to the powerplay tables (while in a high bandwidth mode) and, in the case of Vega, the card will stay in the more reasonable memory clock of 700mhz (versus the max of 950mhz). However, if you then switch to any other high bandwidth mode (e.g. 2560x1440@120), the problem will return (card stuck at 950mhz).

I don't recommend trying that on Navi as powerplay table writing is currently buggy without reverting a commit, I haven't confirmed the behaviour there, but I suspect the same workaround will work.

Comment 24 Sylvain BERTRAND 2019-10-05 12:20:21 UTC

Hi, popping to say it may be the same on southern islands (tahiti xt).

Comment 25 Eduardo 2019-10-06 04:45:58 UTC

I have a PowerColor RedDevil 5700XT and for me, Kernel 5.4-rc1 just works. Memory clocks always at 100Mhz when idle, even using KDE (Plasma 5.16).

amdgpu-pci-0a00
Adapter: PCI adapter
vddgfx:       +0.72 V  
fan1:           0 RPM  (min =    0 RPM, max = 3500 RPM)
edge:         +43.0°C  (crit = +118.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
junction:     +43.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
mem:          +44.0°C  (crit = +99.0°C, hyst = -273.1°C)
                       (emerg = +99.0°C)
power1:       11.00 W  (cap = 220.00 W)

I'm using Display Port, with FreeSync ON, on a 75HZ monitor, with 2560x1080 resolution.

beast ~ # cat /sys/class/drm/card0/device/pp_dpm_mclk 
0: 100Mhz *
1: 500Mhz 
2: 625Mhz 
3: 875Mhz

Comment 26 Andrew Sheldon 2019-10-07 08:31:25 UTC

(In reply to Eduardo from comment #25) 
> I have a PowerColor RedDevil 5700XT and for me, Kernel 5.4-rc1 just works.  
> Memory clocks always at 100Mhz when idle,
even using KDE (Plasma 5.16).

> I'm using Display Port, with FreeSync ON, on a 75HZ monitor, with 2560x1080
> resolution.

It works for you because your resolution is below a certain threshold in
resolution and refresh rate, and not a multi-monitor setup.

Anyway, I think I have a clearer picture of things now.

Firstly, Navi does still have a few additional power consumption issues (as
compared to Vega). One such issue is that if during the boot sequence the
monitor switches to 144hz (the default on one of my monitors), than that's it,
it's impossible to clock down from the maximum mclk at any point. One workaround is to boot with the monitor unplugged, then plug it in after. It seems to happen later in the boot sequence (but before X) so it's possible this can be worked around by changing framebuffer settings.  Furthermore, 2560x1440@144hz in general forces the card to the maximum clock (even with the workaround, although you can switch down to a different mode at least), whereas Vega stays at a lower mclk.

Secondly, multi-monitor configurations will force the card to the maximum clock, by design (on all GPUs). You can workaround this by setting both cards to the same resolution/refresh rate, provided you have a newer kernel. However, this doesn't work with Navi. Another hacky workaround is to write data to the powerplay tables while in dual-monitor setup with mismatched modes, but this is just a hack, and I can't promise stability (although it worked for me, again, only with Vega).

Thirdly, Navi uses a lot more power at idle compared to Vega, even when both are
in the maximum mclk.  E.g. Vega uses around 15W, in a multi-monitor
configuration (2560x1440@90 + 2560x1440@144).
Whereas Navi will use 36W for pretty much any configuration that hits the
maximum mclk. It could be that HBM is more efficient, lower voltages, or even a
reporting error (I haven't tested at the wall, yet).

So in short:
- Navi + 144hz at boot completely breaks mclk switching 
- Navi + 144hz uses unnecessarily high mclk (compared to Vega)
- Multi-monitor high mclk is by design (all GPUs)
- Navi uses a lot more power at idle than Vega, when at the same mclk

Comment 27 Andrew Sheldon 2019-10-07 10:05:26 UTC

A bit of a hacky workaround to 144hz (and multi-monitor issues) on Navi:

- Bootup to X
- Suspend to ram
- Notice that clocks have dropped (even in multi-monitor configuration)
- I get flickering in the auto profile after doing this (maybe similar to the Polaris issues)
- To remove the flickering, set power_dpm_force_performance_level to "low"

Works even in high res/high refresh rate scenarios, such as 2560x1440@90 + 2560x1440@144hz. I haven't extensively stability tested it mind you, but looks good so far.

Obviously, you'll want to set power_dpm_force_performance_level to "high" when playing a game.

Comment 28 Robert 2019-10-26 08:33:06 UTC

(In reply to Andrew Sheldon from comment #27)
> A bit of a hacky workaround to 144hz (and multi-monitor issues) on Navi:
> 

Thanks Andrew for this hack! That's really a joke. This indeed works with my Navi10 and a 5120x1440@60 Hz resolution. I normally can only use 30 Hz or max. 39 Hz if I want to stay at 8W in idle mode. If I use >40 Hz idle power consumption goes up to around 32W. But with your trick I can even use 5120x1440@60 Hz with 8W idle power consumption. :-)

I'm not really into this hardware/driver stuff but I guess this proves that there is a bug somewhere. Either in the firmware, in the driver or maybe even in Mesa or so.

Is someone aware of another place where something like that can be reported? I mean I would really try out everything to help developers nailing this down but I it doesn't look like that there are any AMD developers around here. Ok, maybe AMD just don't cares about bugs at all but hope is the last thing that dies, right? ;-)

Comment 29 Shmerl 2019-11-01 03:01:14 UTC

(In reply to Andrew Sheldon from comment #27)
> A bit of a hacky workaround to 144hz (and multi-monitor issues) on Navi:
> 
> - Bootup to X
> - Suspend to ram
> - Notice that clocks have dropped (even in multi-monitor configuration)
> - I get flickering in the auto profile after doing this (maybe similar to
> the Polaris issues)
> - To remove the flickering, set power_dpm_force_performance_level to "low"
> 

Which one exactly did you set it at?

I have 2560x1440 / 144 Hz monitor (LG 27GL850) and Sapphire Pulse RX 5700 XT (hardware switch set to higher performance BIOS) and in general I noticed a similar thing. During normal idle KDE operation, power stays at around 32 W or so.

If I suspend and resume, power drops to 11 W and the monitor starts flickering wildly. I tried to do:

echo "low" > /sys/class/drm/card0/device/power_dpm_force_performance_level

But that didn't really help with flickering.

Can anyone from AMD please comment, whether 32 W is expected power consumption for light desktop usage at 2560x1440 / 144 Hz for Sapphire Pulse RX 5700 XT?

And clearly, after resume things are now broken regardless of what's the normal level is supposed to be.

Comment 30 Andrew Sheldon 2019-11-02 04:22:18 UTC

(In reply to Shmerl from comment #29)

> Which one exactly did you set it at?
> 
> I have 2560x1440 / 144 Hz monitor (LG 27GL850) and Sapphire Pulse RX 5700 XT
> (hardware switch set to higher performance BIOS) and in general I noticed a
> similar thing. During normal idle KDE operation, power stays at around 32 W
> or so.
> 
> If I suspend and resume, power drops to 11 W and the monitor starts
> flickering wildly. I tried to do:
> 
> echo "low" > /sys/class/drm/card0/device/power_dpm_force_performance_level
> 
> But that didn't really help with flickering.

You may have to first set it to "high", then back to "low".

It will also stop working once a fix that adds "smu->disable_uclk_switch = 0;" in amdgpu_smu.c filters down to the mainline kernels.

Comment 31 Shmerl 2019-11-03 16:55:09 UTC

I can confirm, that at 2560x1440 / 144 Hz, after suspend / resume, setting "high" in /sys/class/drm/card0/device/power_dpm_force_performance_level stops flickering that starts after resume, and then setting "low" there still keeps it flickering free, while dropping MCLK and power consumption to what you expect from a normal idle level! 

You can check that with:

    sudo cat /sys/kernel/debug/dri/0/amdgpu_pm_info

So it means the card can handle it after all, but somehow doesn't dynamically adjust to that state.

Can anyone from AMD please comment on this situation?

Comment 32 Dieter Nützel 2019-11-04 16:31:26 UTC

(In reply to Shmerl from comment #31)
> I can confirm, that at 2560x1440 / 144 Hz, after suspend / resume, setting
> "high" in /sys/class/drm/card0/device/power_dpm_force_performance_level
> stops flickering that starts after resume, and then setting "low" there
> still keeps it flickering free, while dropping MCLK and power consumption to
> what you expect from a normal idle level! 
> 
> You can check that with:
> 
>     sudo cat /sys/kernel/debug/dri/0/amdgpu_pm_info
> 
> So it means the card can handle it after all, but somehow doesn't
> dynamically adjust to that state.
> 
> Can anyone from AMD please comment on this situation?

Hello 'Shmerl',

can you (and the other) please recheck with 'auto', too?
I think we have the 'same' problem with Polaris, too.
If one set low/high it is set @fixed frequency and NOT @ the 'flickering' 'auto' mode.

I can't test the 'suspend / resume' cycle 'cause my server/workstation do NOT work reliable on it.
But power consumption is definitely to high on Polaris, too.

@Alex: What do you think about this?

Comment 33 Shmerl 2019-11-04 16:42:56 UTC

(In reply to Dieter Nützel from comment #32)
> 
> Hello 'Shmerl',
> 
> can you (and the other) please recheck with 'auto', too?
> I think we have the 'same' problem with Polaris, too.

Can you clarify please, what scenario exactly do you want me to test? When computer boots (or resumes), the value is "auto" by default. On "auto" after boot, idle power consumption is high (30+W). After resume, with that "auto" value, the screen starts flickering, I'll check what power consumption it has at that point a bit later.

Comment 34 Dieter Nützel 2019-11-04 23:48:19 UTC

(In reply to Shmerl from comment #33)
> (In reply to Dieter Nützel from comment #32)
> > 
> > Hello 'Shmerl',
> > 
> > can you (and the other) please recheck with 'auto', too?
> > I think we have the 'same' problem with Polaris, too.
> 
> Can you clarify please, what scenario exactly do you want me to test? When
> computer boots (or resumes), the value is "auto" by default.

Expected.

> On "auto" after
> boot, idle power consumption is high (30+W).

To high.
With 1 or more identical monitors?
Compare Alex's latest >=2 identical monitor patches.
I get the same even on Polaris (Alex?).

> After resume, with that "auto"
> value, the screen starts flickering,

That is currently the expected 'auto' clk transition bug.
If someone set the clks to low/high flickering can't happen.

> I'll check what power consumption it
> has at that point a bit later.

That's the 'new' interesting part that Andrew, Robert and you find with Navi.
The suspend / resume (with later high/low) 'cycle' which led to much lower power consumption.

Which I couldn't verify on my Polaris system (with 2 identical HDMI monitors) currently.
With 'low' and 2 identical HDMI displays I get the below under 'amd-staging-drm-next':

GFX Clocks and Power:
        300 MHz (MCLK)
        300 MHz (SCLK)
        600 MHz (PSTATE_SCLK)
        1000 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        32.174 W (average GPU)

PSTATE_SCLK and PSTATE_MCLK do NOT drop and much to high W.
NO flickering due to 'low'.

I'll point Alex to this thread.

Comment 35 Andrew Sheldon 2019-11-05 00:37:33 UTC

(In reply to Dieter Nützel from comment #34)

> Which I couldn't verify on my Polaris system (with 2 identical HDMI
> monitors) currently.
> With 'low' and 2 identical HDMI displays I get the below under
> 'amd-staging-drm-next':
> 
> GFX Clocks and Power:
>         300 MHz (MCLK)
>         300 MHz (SCLK)
>         600 MHz (PSTATE_SCLK)
>         1000 MHz (PSTATE_MCLK)
>         750 mV (VDDGFX)
>         32.174 W (average GPU)
> 
> PSTATE_SCLK and PSTATE_MCLK do NOT drop and much to high W.
> NO flickering due to 'low'.
> 
> I'll point Alex to this thread.

You need to revert f6505e375fe8 , which "fixed" the flickering bug (but prevents the lower power consumption behaviour).

Comment 36 Shmerl 2019-11-05 00:43:46 UTC

OK, I recorded some data after different steps.

Sapphire Pulse RX 5700 XT, LG 27GL85-B (2560x1440, 144 Hz), DisplayPort 1.4 connection enabled in the monitor, DP 1.4 cable used.

1. After normal, boot:

auto

GFX Clocks and Power:
        875 MHz (MCLK)
        800 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        30.0 W (average GPU)
        
no flickering.

2. Doing:

echo "high" > /sys/class/drm/card0/device/power_dpm_force_performance_level

GFX Clocks and Power:
        875 MHz (MCLK)
        2045 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        1200 mV (VDDGFX)
        33.0 W (average GPU)        

no flickering

3. Doing:

echo "low" > /sys/class/drm/card0/device/power_dpm_force_performance_level

GFX Clocks and Power:
        875 MHz (MCLK)
        300 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        30.0 W (average GPU)

no flickering

4. Then, before suspending:

echo "auto" > /sys/class/drm/card0/device/power_dpm_force_performance_level

5. Suspend, and resume:

auto

GFX Clocks and Power:
        100 MHz (MCLK)
        800 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        10.0 W (average GPU)

strong flickering

6. After that doing:

echo 'high' > power_dpm_force_performance_level

GFX Clocks and Power:
        875 MHz (MCLK)
        2045 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        1200 mV (VDDGFX)
        33.0 W (average GPU)
no flickering

7. Doing:

echo 'low' > power_dpm_force_performance_level

GFX Clocks and Power:
        100 MHz (MCLK)
        300 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        10.0 W (average GPU)

no flickering!

8. And then:

echo 'auto' > power_dpm_force_performance_level

GFX Clocks and Power:
        100 MHz (MCLK)
        800 MHz (SCLK)
        300 MHz (PSTATE_SCLK)
        100 MHz (PSTATE_MCLK)
        750 mV (VDDGFX)
        10.0 W (average GPU)
        
flickering again!

Comment 37 Shmerl 2019-11-05 00:50:39 UTC

If that makes any difference, I enabled adaptive sync for amdgpu, and didn't revert any commits. Using regular 5.4-rc6 kernel.

Comment 38 Andrew Sheldon 2019-11-05 05:57:31 UTC

(In reply to Shmerl from comment #37)
> If that makes any difference, I enabled adaptive sync for amdgpu, and didn't
> revert any commits. Using regular 5.4-rc6 kernel.

The fix isn't in 5.4-rcX yet, so no need to revert anything there.

Comment 39 Shmerl 2019-11-08 01:35:24 UTC

With kernel 5.4-rc6 I'm now seeing such errors once in 20 minutes or so:

[37947.927301] WARNING: CPU: 5 PID: 992 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2806 dcn20_validate_bandwidth+0xc0/0xd0 [amdgpu]
[37947.927301] Modules linked in: snd_seq_dummy(E) snd_seq(E) macvtap(E) macvlan(E) tap(E) xt_CHECKSUM(E) ipt_REJECT(E) nf_reject_ipv4(E) xt_tcpudp(E) nft_compat(E) nft_counter(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) tun(E) bridge(E) stp(E) llc(E) rfcomm(E) nf_tables(E) nfnetlink(E) bnep(E) edac_mce_amd(E) kvm_amd(E) kvm(E) irqbypass(E) crct10dif_pclmul(E) crc32_pclmul(E) ghash_clmulni_intel(E) btusb(E) snd_hda_codec_realtek(E) btrtl(E) btbcm(E) snd_hda_codec_generic(E) btintel(E) ledtrig_audio(E) iwlmvm(E) snd_hda_codec_hdmi(E) bluetooth(E) uvcvideo(E) snd_hda_intel(E) mac80211(E) videobuf2_vmalloc(E) snd_intel_nhlt(E) videobuf2_memops(E) libarc4(E) snd_usb_audio(E) videobuf2_v4l2(E) nls_ascii(E) snd_hda_codec(E) efi_pstore(E) videobuf2_common(E) snd_usbmidi_lib(E) nls_cp437(E) snd_rawmidi(E) aesni_intel(E) snd_hda_core(E) snd_seq_device(E) vfat(E) snd_hwdep(E) videodev(E) crypto_simd(E) drbg(E) fat(E) cryptd(E) iwlwifi(E) snd_pcm(E) mc(E)
[37947.927323]  glue_helper(E) ansi_cprng(E) wmi_bmof(E) efivars(E) pcspkr(E) sp5100_tco(E) snd_timer(E) ecdh_generic(E) ecc(E) ccp(E) watchdog(E) snd(E) k10temp(E) crc16(E) soundcore(E) sg(E) rng_core(E) cfg80211(E) rfkill(E) evdev(E) acpi_cpufreq(E) nct6775(E) hwmon_vid(E) parport_pc(E) ppdev(E) lp(E) parport(E) efivarfs(E) ip_tables(E) x_tables(E) autofs4(E) xfs(E) btrfs(E) xor(E) zstd_decompress(E) zstd_compress(E) raid6_pq(E) libcrc32c(E) crc32c_generic(E) sd_mod(E) hid_generic(E) usbhid(E) hid(E) amdgpu(E) gpu_sched(E) ttm(E) drm_kms_helper(E) ahci(E) mxm_wmi(E) libahci(E) drm(E) crc32c_intel(E) xhci_pci(E) libata(E) xhci_hcd(E) i2c_piix4(E) mfd_core(E) igb(E) scsi_mod(E) dca(E) usbcore(E) ptp(E) pps_core(E) i2c_algo_bit(E) nvme(E) nvme_core(E) wmi(E) button(E)
[37947.927347] CPU: 5 PID: 992 Comm: Xorg Tainted: G        W   E     5.4.0-rc6+ #29
[37947.927348] Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./X570 Taichi, BIOS P2.11 09/25/2019
[37947.927424] RIP: 0010:dcn20_validate_bandwidth+0xc0/0xd0 [amdgpu]
[37947.927426] Code: 5d 41 5c 41 5d e9 d0 fc ff ff f2 0f 11 85 70 21 00 00 31 d2 48 89 ee 4c 89 e7 e8 bb fc ff ff 41 89 c5 22 85 c8 1d 00 00 75 04 <0f> 0b eb 92 c6 85 c8 1d 00 00 00 41 89 c5 eb 86 0f 1f 44 00 00 41
[37947.927427] RSP: 0018:ffffbf3b41e67ad0 EFLAGS: 00010246
[37947.927428] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000017795
[37947.927429] RDX: 0000000000017794 RSI: ffff9e69fe96db40 RDI: 000000000002db40
[37947.927429] RBP: ffff9e695ce90000 R08: 0000000000000006 R09: 0000000000000000
[37947.927430] R10: ffff9e69eb8f0000 R11: 0000000100000001 R12: ffff9e69eb8f0000
[37947.927431] R13: 0000000000000001 R14: 0000000000000000 R15: ffff9e695ce90000
[37947.927432] FS:  00007f4774b06f00(0000) GS:ffff9e69fe940000(0000) knlGS:0000000000000000
[37947.927433] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[37947.927434] CR2: 00007f46bc3c5200 CR3: 00000007dc9ba000 CR4: 0000000000340ee0
[37947.927434] Call Trace:
[37947.927508]  dc_validate_global_state+0x25f/0x2d0 [amdgpu]
[37947.927581]  amdgpu_dm_atomic_check+0x5a1/0x7e0 [amdgpu]
[37947.927597]  drm_atomic_check_only+0x554/0x7e0 [drm]
[37947.927611]  ? drm_connector_list_iter_next+0x7d/0x90 [drm]
[37947.927622]  drm_atomic_commit+0x13/0x50 [drm]
[37947.927634]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[37947.927648]  drm_mode_obj_set_property_ioctl+0x159/0x2b0 [drm]
[37947.927661]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[37947.927671]  drm_connector_property_set_ioctl+0x39/0x60 [drm]
[37947.927681]  drm_ioctl_kernel+0xaa/0xf0 [drm]
[37947.927691]  drm_ioctl+0x208/0x390 [drm]
[37947.927702]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[37947.927750]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[37947.927754]  do_vfs_ioctl+0x40e/0x670
[37947.927757]  ? do_setitimer+0xde/0x230
[37947.927759]  ksys_ioctl+0x5e/0x90
[37947.927761]  __x64_sys_ioctl+0x16/0x20
[37947.927763]  do_syscall_64+0x52/0x160
[37947.927766]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[37947.927768] RIP: 0033:0x7f477504e5d7
[37947.927769] Code: 00 00 90 48 8b 05 b9 78 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 89 78 0c 00 f7 d8 64 89 01 48
[37947.927770] RSP: 002b:00007ffcbf170a38 EFLAGS: 00003246 ORIG_RAX: 0000000000000010
[37947.927771] RAX: ffffffffffffffda RBX: 00007ffcbf170a70 RCX: 00007f477504e5d7
[37947.927772] RDX: 00007ffcbf170a70 RSI: 00000000c01064ab RDI: 000000000000000d
[37947.927773] RBP: 00000000c01064ab R08: 0000000000000000 R09: 00007f477471ad10
[37947.927773] R10: 00007f477471ad20 R11: 0000000000003246 R12: 000055962a2ad220
[37947.927774] R13: 000000000000000d R14: 0000559627814780 R15: 0000000000000000

Comment 40 Shmerl 2019-11-08 01:36:36 UTC

To correct, it's 5.4-rc6 plus these patches:
https://cgit.freedesktop.org/~agd5f/linux/diff/?h=drm-fixes-5.4-2019-11-06&id=2c409ba81be25516afe05ae27a4a15da01740b01&id2=a99d8080aaf358d5d23581244e5da23b35e340b9

Comment 41 Robert 2019-11-08 06:36:31 UTC

(In reply to Shmerl from comment #39)
> With kernel 5.4-rc6 I'm now seeing such errors once in 20 minutes or so:
> 

I don't see it that often but I also getting it from time to time. I don't use any patches. It's plain 5.4rc6. But I can't see any obvious consequences.

[Fri Nov  8 07:22:49 2019] WARNING: CPU: 22 PID: 2129 at drivers/gpu/drm/amd/amdgpu/../display/dc/dcn20/dcn20_resource.c:2801 dcn20_validate_bandwidth+0xc0/0xd0 [amdgpu]
[Fri Nov  8 07:22:49 2019] Modules linked in: msr ngene dm_mod vhost_net vhost tap tun fuse xt_nat veth xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat br_netfilter overlay wireguard(OE) ip6_udp_tunnel udp_tunnel ebtable_filter ebtables edac_mce_amd kvm_amd snd_usb_audio snd_usbmidi_lib snd_rawmidi snd_seq_device nct6775(OE) hwmon_vid nls_iso8859_1 nls_cp437 vfat fat stv6110x eeepc_wmi lnbp21 asus_wmi battery sparse_keymap wmi_bmof mxm_wmi kvm snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_codec_hdmi crct10dif_pclmul crc32_pclmul snd_hda_intel ghash_clmulni_intel snd_intel_nhlt snd_hda_codec bridge snd_hda_core btusb btrtl btbcm snd_hwdep stp btintel llc joydev aesni_intel mousedev crypto_simd bluetooth stv090x input_leds cryptd snd_pcm glue_helper snd_timer pcspkr igb k10temp ecdh_generic snd ccp rfkill sp5100_tco ecc rng_core i2c_piix4 soundcore dca dvb_core pinctrl_amd evdev mac_hid wmi acpi_cpufreq nf_log_ipv6 ip6t_REJECT nf_reject_ipv6 xt_hl ip6t_rt
[Fri Nov  8 07:22:49 2019]  nf_log_ipv4 nf_log_common ipt_REJECT nf_reject_ipv4 xt_LOG xt_multiport xt_limit xt_addrtype xt_tcpudp xt_conntrack ip6table_filter ip6_tables nf_conntrack_netbios_ns nf_conntrack_broadcast nf_nat_ftp nf_nat nf_conntrack_ftp nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c iptable_filter nfsd uvcvideo videobuf2_vmalloc videobuf2_memops auth_rpcgss videobuf2_v4l2 videobuf2_common nfs_acl videodev lockd grace mc sunrpc sg crypto_user ip_tables x_tables ext4 crc32c_generic crc16 mbcache jbd2 hid_logitech_hidpp hid_logitech_dj hid_generic usbhid hid sd_mod ahci libahci libata crc32c_intel xhci_pci scsi_mod xhci_hcd amdgpu gpu_sched i2c_algo_bit ttm drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm agpgart vfio_pci irqbypass vfio_virqfd vfio_iommu_type1 vfio [last unloaded: ngene]
[Fri Nov  8 07:22:49 2019] CPU: 22 PID: 2129 Comm: Xorg Tainted: G        W  OE     5.4.0-rc6-mainline #1
[Fri Nov  8 07:22:49 2019] Hardware name: System manufacturer System Product Name/ROG STRIX X570-E GAMING, BIOS 1201 09/09/2019
[Fri Nov  8 07:22:49 2019] RIP: 0010:dcn20_validate_bandwidth+0xc0/0xd0 [amdgpu]
[Fri Nov  8 07:22:49 2019] Code: 5d 41 5c 41 5d e9 a0 fc ff ff f2 0f 11 85 70 21 00 00 31 d2 48 89 ee 4c 89 e7 e8 8b fc ff ff 41 89 c5 22 85 c8 1d 00 00 75 04 <0f> 0b eb 92 c6 85 c8 1d 00 00 00 41 89 c5 eb 86 0f 1f 44 00 00 41
[Fri Nov  8 07:22:49 2019] RSP: 0018:ffff959b49adbaa0 EFLAGS: 00010246
[Fri Nov  8 07:22:49 2019] RAX: 0000000000000000 RBX: ffff93672e822bf8 RCX: 000000000374a816
[Fri Nov  8 07:22:49 2019] RDX: 000000000374a616 RSI: ffff93673edaf1a0 RDI: 000000000002f1a0
[Fri Nov  8 07:22:49 2019] RBP: ffff93658ccc0000 R08: 0000000000000006 R09: 0000000000000000
[Fri Nov  8 07:22:49 2019] R10: 0000000000000001 R11: 0000000100000001 R12: ffff93672faf0000
[Fri Nov  8 07:22:49 2019] R13: 0000000000000001 R14: 0000000000000000 R15: ffff93672ef01400
[Fri Nov  8 07:22:49 2019] FS:  00007f43fa01adc0(0000) GS:ffff93673ed80000(0000) knlGS:0000000000000000
[Fri Nov  8 07:22:49 2019] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[Fri Nov  8 07:22:49 2019] CR2: 00007f6c55528008 CR3: 0000000fad9c0000 CR4: 0000000000340ee0
[Fri Nov  8 07:22:49 2019] Call Trace:
[Fri Nov  8 07:22:49 2019]  dc_validate_global_state+0x28a/0x310 [amdgpu]
[Fri Nov  8 07:22:49 2019]  ? drm_modeset_lock+0x31/0xb0 [drm]
[Fri Nov  8 07:22:49 2019]  amdgpu_dm_atomic_check+0x5a2/0x800 [amdgpu]
[Fri Nov  8 07:22:49 2019]  drm_atomic_check_only+0x578/0x800 [drm]
[Fri Nov  8 07:22:49 2019]  ? _raw_spin_unlock_irqrestore+0x20/0x40
[Fri Nov  8 07:22:49 2019]  drm_atomic_commit+0x13/0x50 [drm]
[Fri Nov  8 07:22:49 2019]  drm_atomic_connector_commit_dpms+0xd7/0x100 [drm]
[Fri Nov  8 07:22:49 2019]  drm_mode_obj_set_property_ioctl+0x169/0x2c0 [drm]
[Fri Nov  8 07:22:49 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Fri Nov  8 07:22:49 2019]  drm_connector_property_set_ioctl+0x41/0x60 [drm]
[Fri Nov  8 07:22:49 2019]  drm_ioctl_kernel+0xb2/0x100 [drm]
[Fri Nov  8 07:22:49 2019]  drm_ioctl+0x209/0x360 [drm]
[Fri Nov  8 07:22:49 2019]  ? drm_connector_set_obj_prop+0x90/0x90 [drm]
[Fri Nov  8 07:22:49 2019]  amdgpu_drm_ioctl+0x49/0x80 [amdgpu]
[Fri Nov  8 07:22:49 2019]  do_vfs_ioctl+0x43d/0x6c0
[Fri Nov  8 07:22:49 2019]  ksys_ioctl+0x5e/0x90
[Fri Nov  8 07:22:49 2019]  __x64_sys_ioctl+0x16/0x20
[Fri Nov  8 07:22:49 2019]  do_syscall_64+0x5b/0x1a0
[Fri Nov  8 07:22:49 2019]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[Fri Nov  8 07:22:49 2019] RIP: 0033:0x7f43fb26425b
[Fri Nov  8 07:22:49 2019] Code: 0f 1e fa 48 8b 05 25 9c 0c 00 64 c7 00 26 00 00 00 48 c7 c0 ff ff ff ff c3 66 0f 1f 44 00 00 f3 0f 1e fa b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d f5 9b 0c 00 f7 d8 64 89 01 48
[Fri Nov  8 07:22:49 2019] RSP: 002b:00007fffa2ba75d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[Fri Nov  8 07:22:49 2019] RAX: ffffffffffffffda RBX: 00007fffa2ba7610 RCX: 00007f43fb26425b
[Fri Nov  8 07:22:49 2019] RDX: 00007fffa2ba7610 RSI: 00000000c01064ab RDI: 000000000000000d
[Fri Nov  8 07:22:49 2019] RBP: 00000000c01064ab R08: 0000000000000000 R09: 00007f43fb372d10
[Fri Nov  8 07:22:49 2019] R10: 00007f43fb372d20 R11: 0000000000000246 R12: 000055984f983b90
[Fri Nov  8 07:22:49 2019] R13: 000000000000000d R14: 0000000000000000 R15: 0000000000000000
[Fri Nov  8 07:22:49 2019] ---[ end trace 838cf1460840b9b2 ]---

Comment 42 Martin Peres 2019-11-19 09:50:33 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/893.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.