Bug 93217

Summary: [tonga] [powerplay] Radon M395X isn't initialised with the powerplay branch
Product: DRI Reporter: Mike Lothian <mike>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: edward.ocallaghan, mike
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg powerplay
none
dmesg tip
none
xorg.log tip
none
xorg.log powerplay
none
Updated dmesg with drm.debug=0xf
none
add debugging output
none
Dmesg with patch applied
none
Ooops on shutdown
none
Dmesg with rebased powerplay branch
none
Dmesg ignoring table
none
disable pcie dpm
none
Divide error
none
Powerplay working
none
Screenshot of divide by zero error
none
disable pcie gen3 switching
none
Latest dmesg none

Description Mike Lothian 2015-12-02 18:36:32 UTC
Created attachment 120284 [details]
dmesg powerplay

The card doesn't initialise:

amdgpu 0000:01:00.0: Fatal error during GPU init
[TTM] Memory type 2 has not been initialized
amdgpu: probe of 0000:01:00.0 failed with error -1

This is a hybrid system with skylake

I'll attach the dmesg with the powerplay branch and linus's tree

This is with runpm=0

00:00.0 Host bridge [0600]: Intel Corporation Sky Lake Host Bridge/DRAM Registers [8086:1910] (rev 07)
00:01.0 PCI bridge [0604]: Intel Corporation Sky Lake PCIe Controller (x16) [8086:1901] (rev 07)
00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:191b] (rev 06)
00:04.0 Signal processing controller [1180]: Intel Corporation Device [8086:1903] (rev 07)
00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller [8086:a12f] (rev 31)
00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-H Thermal subsystem [8086:a131] (rev 31)
00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31)
00:17.0 SATA controller [0106]: Intel Corporation Sunrise Point-H SATA Controller [AHCI mode] [8086:a103] (rev 31)
00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #1 [8086:a110] (rev f1)
00:1c.4 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1)
00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #6 [8086:a115] (rev f1)
00:1c.6 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #7 [8086:a116] (rev f1)
00:1d.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #9 [8086:a118] (rev f1)
00:1f.0 ISA bridge [0601]: Intel Corporation Sunrise Point-H LPC Controller [8086:a14e] (rev 31)
00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-H PMC [8086:a121] (rev 31)
00:1f.3 Audio device [0403]: Intel Corporation Sunrise Point-H HD Audio [8086:a170] (rev 31)
00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-H SMBus [8086:a123] (rev 31)
01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Amethyst XT [Radeon R9 M295X] [1002:6921]
3b:00.0 Ethernet controller [0200]: Qualcomm Atheros Device [1969:e0a1] (rev 10)
3c:00.0 Network controller [0280]: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter [168c:003e] (rev 32)
3d:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5227 PCI Express Card Reader [10ec:5227] (rev 01)
3e:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd Device [144d:a802] (rev 01)
Comment 1 Mike Lothian 2015-12-02 18:36:58 UTC
Created attachment 120285 [details]
dmesg tip
Comment 2 Mike Lothian 2015-12-02 18:37:27 UTC
Created attachment 120286 [details]
xorg.log tip
Comment 3 Mike Lothian 2015-12-02 18:37:55 UTC
Created attachment 120287 [details]
xorg.log powerplay
Comment 4 Mike Lothian 2015-12-05 02:15:49 UTC
I'd just like to add that booting with amdgpu.powerplay=0 allow the card to be initialized
Comment 5 Mike Lothian 2015-12-07 19:33:41 UTC
Created attachment 120392 [details]
Updated dmesg with drm.debug=0xf
Comment 6 Alex Deucher 2015-12-07 22:05:03 UTC
Created attachment 120397 [details] [review]
add debugging output

Please apply this patch and attach the output.
Comment 7 Mike Lothian 2015-12-07 22:11:36 UTC
Created attachment 120399 [details]
Dmesg with patch applied
Comment 8 Mike Lothian 2015-12-11 13:02:20 UTC
Created attachment 120464 [details]
Ooops on shutdown

This is what's produced on shutdown on Linus's tree (0bd0f1e6d40aa16c4d507b1fff27163a7e7711f5) I'm not sure if it's related
Comment 9 Alex Deucher 2015-12-11 17:43:34 UTC
Please try the latest powerplay branch and attach the dmesg log.  It will print some additional debugging info for the failures.
Comment 10 Mike Lothian 2015-12-11 18:37:00 UTC
Created attachment 120466 [details]
Dmesg with rebased powerplay branch

Seems "init_thermal_controller failed" is the problem
Comment 11 Alex Deucher 2015-12-14 15:51:47 UTC
See if the latest patches in my powerplay branch helps.
Comment 12 Mike Lothian 2015-12-14 19:01:32 UTC
Created attachment 120501 [details]
Dmesg ignoring table

This is strange - it looks as though initialisation gets a lot further but then stops when it "Failed to send Previous Message" and "unforce pcie level failed!"

The loading of the kernel felt longer than before and after 30 seconds the whole machine locked up

I managed to save this dmesg just in time after several attempts
Comment 13 Alex Deucher 2015-12-14 21:05:03 UTC
Created attachment 120503 [details] [review]
disable pcie dpm

Does this allow it to start?
Comment 14 Mike Lothian 2015-12-14 22:43:36 UTC
Created attachment 120504 [details]
Divide error

It was still doing the slow booting thing and freezing just after the disks mount and before X starts, I don't get the "Last Message" messages anymore, I tried doing this a few times and got the above divide error
Comment 15 Alex Deucher 2015-12-14 23:45:40 UTC
Try my updated powerplay branch both with and without the disable_pcie_dpm patch.
Comment 16 Mike Lothian 2015-12-15 00:34:43 UTC
Created attachment 120507 [details]
Powerplay working

I still have to set pcie_dpm_key_disabled = 1 

I had to turn of pid cgroups for some reason

I also had to turn off runpm, it seems the card is initialised the first time then when it's powered back up has a hissy fit

I tested Metro 2033 Redux so I know I was definitely running on Tonga but I locked up the system setting everything to max

Progress :D
Comment 17 Mike Lothian 2015-12-15 00:36:56 UTC
The card is also reported my lm_sensors - someone asked about this on IRC the other day

amdgpu-pci-0100
Adapter: PCI adapter
temp1:        +59.0°C  (crit =  +0.0°C, hyst =  +0.0°C)
Comment 18 Mike Lothian 2015-12-19 03:11:18 UTC
Created attachment 120589 [details]
Screenshot of divide by zero error

I've just retried your latest powerplay branch

I still have to set pcie_dpm_key_disabled = 1 to prevent the message errors

I however how get another divide by zero issue and the kernel doesn't start

I'm attaching a screenshot
Comment 19 Mike Lothian 2015-12-20 14:37:07 UTC
The commit "amd/powerplay: don't enable ucode fan control if vbios has no fan table" allows my machine to boot again - thanks
Comment 20 Alex Deucher 2015-12-21 21:25:04 UTC
Created attachment 120644 [details] [review]
disable pcie gen3 switching

Does this patch allow you to use pcie dpm?
Comment 21 Mike Lothian 2015-12-21 21:35:43 UTC
Nope, I still get the "message" errors
Comment 22 Mike Lothian 2016-01-21 14:42:35 UTC
On both drm next and linus's tree with powerplay enabled and pcie dpm disabled runpm appears to be working however the performance doesn't seem as good as before

I'm not convinced powerplay is enabled
Comment 23 Mike Lothian 2016-01-21 14:43:22 UTC
Created attachment 121181 [details]
Latest dmesg
Comment 24 Alex Deucher 2016-01-21 19:43:50 UTC
Powerplay is still disabled by default in drm-next and linus tree.  Enable it by passing amdgpu.powerplay=1 on the kernel command line in grub.
Comment 25 Mike Lothian 2016-01-22 00:31:59 UTC
As you can see from the dmesg - I've enabled it

It seems to be a runpm problem, with amdgpu.runpm=1 the chip now reinitialises when DRI_PRIME=1 is passed but it seems the speeds don't ramp up and there's a warning about themal values. With amdgpu.runpm=0 everything still seems to work but of course the card doesn't switch off when not in use
Comment 26 Joshua 2016-01-22 19:14:24 UTC
http://url.dsl.pp.ua/875575/pils/page.php
Comment 27 Mike Lothian 2016-01-27 23:42:30 UTC
I've tried your latest 4.6-wip branch, I still have to disable pcie_dpm to get the kernel to boot, and I have to disable runpm to get performance

Let me know if you'd like me to test any of the new sysfs knobs you've added
Comment 28 Mike Lothian 2016-01-28 00:31:05 UTC
0: 2.5GB, x8 
1: 8.0GB, x16 *
2: 8.0GB, x16 
3: 8.0GB, x16 
4: 8.0GB, x16 
5: 8.0GB, x16 
6: 8.0GB, x16 
./devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_dpm_pcie (END)
Comment 29 Mike Lothian 2016-02-06 05:14:27 UTC
Would you like me to try re-enabling pcie dpm and use the new kernel parameters in drm-next-4.6-wip?
Comment 30 Alex Deucher 2016-02-22 17:20:58 UTC
(In reply to Mike Lothian from comment #29)
> Would you like me to try re-enabling pcie dpm and use the new kernel
> parameters in drm-next-4.6-wip?

Yes, can you see what combinations of pcie_gen_cap and pcie_lane_cap help?  See amd_pcie.h.

pcie_gen_cap:
bits 31:16 define the gen speeds supported by the platform (e.g., the motherboard).  Setting it to CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 (0x00040000) would indicate that the motherboard only supports gen3 (not gen 2 or 1).
bits 15:0 define the pcie speeds supported by the GPU itself.  Setting this to CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3 (0x00000004) means the asic only supports gen 3.

pcie_lane_cap:
bits 31:16 define the link width supported by the platform.
Comment 31 Mike Lothian 2016-02-22 22:54:41 UTC
Sorry I'm a wee bit confused at what you'd like me to do
Comment 32 Mike Lothian 2016-04-15 17:51:21 UTC
I've just tried booting without using the PCIe DPM patch and it now seems to work

Not sure if it was a commit or the new firmware - either way this works for me now
Comment 33 Alex Deucher 2016-05-31 18:37:38 UTC
What model laptop is this?
Comment 34 Mike Lothian 2016-05-31 18:39:28 UTC
Alienware 15 R2

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.