Created attachment 120284 [details] dmesg powerplay The card doesn't initialise: amdgpu 0000:01:00.0: Fatal error during GPU init [TTM] Memory type 2 has not been initialized amdgpu: probe of 0000:01:00.0 failed with error -1 This is a hybrid system with skylake I'll attach the dmesg with the powerplay branch and linus's tree This is with runpm=0 00:00.0 Host bridge [0600]: Intel Corporation Sky Lake Host Bridge/DRAM Registers [8086:1910] (rev 07) 00:01.0 PCI bridge [0604]: Intel Corporation Sky Lake PCIe Controller (x16) [8086:1901] (rev 07) 00:02.0 VGA compatible controller [0300]: Intel Corporation Device [8086:191b] (rev 06) 00:04.0 Signal processing controller [1180]: Intel Corporation Device [8086:1903] (rev 07) 00:14.0 USB controller [0c03]: Intel Corporation Sunrise Point-H USB 3.0 xHCI Controller [8086:a12f] (rev 31) 00:14.2 Signal processing controller [1180]: Intel Corporation Sunrise Point-H Thermal subsystem [8086:a131] (rev 31) 00:16.0 Communication controller [0780]: Intel Corporation Sunrise Point-H CSME HECI #1 [8086:a13a] (rev 31) 00:17.0 SATA controller [0106]: Intel Corporation Sunrise Point-H SATA Controller [AHCI mode] [8086:a103] (rev 31) 00:1c.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #1 [8086:a110] (rev f1) 00:1c.4 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #5 [8086:a114] (rev f1) 00:1c.5 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #6 [8086:a115] (rev f1) 00:1c.6 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #7 [8086:a116] (rev f1) 00:1d.0 PCI bridge [0604]: Intel Corporation Sunrise Point-H PCI Express Root Port #9 [8086:a118] (rev f1) 00:1f.0 ISA bridge [0601]: Intel Corporation Sunrise Point-H LPC Controller [8086:a14e] (rev 31) 00:1f.2 Memory controller [0580]: Intel Corporation Sunrise Point-H PMC [8086:a121] (rev 31) 00:1f.3 Audio device [0403]: Intel Corporation Sunrise Point-H HD Audio [8086:a170] (rev 31) 00:1f.4 SMBus [0c05]: Intel Corporation Sunrise Point-H SMBus [8086:a123] (rev 31) 01:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Amethyst XT [Radeon R9 M295X] [1002:6921] 3b:00.0 Ethernet controller [0200]: Qualcomm Atheros Device [1969:e0a1] (rev 10) 3c:00.0 Network controller [0280]: Qualcomm Atheros QCA6174 802.11ac Wireless Network Adapter [168c:003e] (rev 32) 3d:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5227 PCI Express Card Reader [10ec:5227] (rev 01) 3e:00.0 Non-Volatile memory controller [0108]: Samsung Electronics Co Ltd Device [144d:a802] (rev 01)
Created attachment 120285 [details] dmesg tip
Created attachment 120286 [details] xorg.log tip
Created attachment 120287 [details] xorg.log powerplay
I'd just like to add that booting with amdgpu.powerplay=0 allow the card to be initialized
Created attachment 120392 [details] Updated dmesg with drm.debug=0xf
Created attachment 120397 [details] [review] add debugging output Please apply this patch and attach the output.
Created attachment 120399 [details] Dmesg with patch applied
Created attachment 120464 [details] Ooops on shutdown This is what's produced on shutdown on Linus's tree (0bd0f1e6d40aa16c4d507b1fff27163a7e7711f5) I'm not sure if it's related
Please try the latest powerplay branch and attach the dmesg log. It will print some additional debugging info for the failures.
Created attachment 120466 [details] Dmesg with rebased powerplay branch Seems "init_thermal_controller failed" is the problem
See if the latest patches in my powerplay branch helps.
Created attachment 120501 [details] Dmesg ignoring table This is strange - it looks as though initialisation gets a lot further but then stops when it "Failed to send Previous Message" and "unforce pcie level failed!" The loading of the kernel felt longer than before and after 30 seconds the whole machine locked up I managed to save this dmesg just in time after several attempts
Created attachment 120503 [details] [review] disable pcie dpm Does this allow it to start?
Created attachment 120504 [details] Divide error It was still doing the slow booting thing and freezing just after the disks mount and before X starts, I don't get the "Last Message" messages anymore, I tried doing this a few times and got the above divide error
Try my updated powerplay branch both with and without the disable_pcie_dpm patch.
Created attachment 120507 [details] Powerplay working I still have to set pcie_dpm_key_disabled = 1 I had to turn of pid cgroups for some reason I also had to turn off runpm, it seems the card is initialised the first time then when it's powered back up has a hissy fit I tested Metro 2033 Redux so I know I was definitely running on Tonga but I locked up the system setting everything to max Progress :D
The card is also reported my lm_sensors - someone asked about this on IRC the other day amdgpu-pci-0100 Adapter: PCI adapter temp1: +59.0°C (crit = +0.0°C, hyst = +0.0°C)
Created attachment 120589 [details] Screenshot of divide by zero error I've just retried your latest powerplay branch I still have to set pcie_dpm_key_disabled = 1 to prevent the message errors I however how get another divide by zero issue and the kernel doesn't start I'm attaching a screenshot
The commit "amd/powerplay: don't enable ucode fan control if vbios has no fan table" allows my machine to boot again - thanks
Created attachment 120644 [details] [review] disable pcie gen3 switching Does this patch allow you to use pcie dpm?
Nope, I still get the "message" errors
On both drm next and linus's tree with powerplay enabled and pcie dpm disabled runpm appears to be working however the performance doesn't seem as good as before I'm not convinced powerplay is enabled
Created attachment 121181 [details] Latest dmesg
Powerplay is still disabled by default in drm-next and linus tree. Enable it by passing amdgpu.powerplay=1 on the kernel command line in grub.
As you can see from the dmesg - I've enabled it It seems to be a runpm problem, with amdgpu.runpm=1 the chip now reinitialises when DRI_PRIME=1 is passed but it seems the speeds don't ramp up and there's a warning about themal values. With amdgpu.runpm=0 everything still seems to work but of course the card doesn't switch off when not in use
http://url.dsl.pp.ua/875575/pils/page.php
I've tried your latest 4.6-wip branch, I still have to disable pcie_dpm to get the kernel to boot, and I have to disable runpm to get performance Let me know if you'd like me to test any of the new sysfs knobs you've added
0: 2.5GB, x8 1: 8.0GB, x16 * 2: 8.0GB, x16 3: 8.0GB, x16 4: 8.0GB, x16 5: 8.0GB, x16 6: 8.0GB, x16 ./devices/pci0000:00/0000:00:01.0/0000:01:00.0/pp_dpm_pcie (END)
Would you like me to try re-enabling pcie dpm and use the new kernel parameters in drm-next-4.6-wip?
(In reply to Mike Lothian from comment #29) > Would you like me to try re-enabling pcie dpm and use the new kernel > parameters in drm-next-4.6-wip? Yes, can you see what combinations of pcie_gen_cap and pcie_lane_cap help? See amd_pcie.h. pcie_gen_cap: bits 31:16 define the gen speeds supported by the platform (e.g., the motherboard). Setting it to CAIL_PCIE_LINK_SPEED_SUPPORT_GEN3 (0x00040000) would indicate that the motherboard only supports gen3 (not gen 2 or 1). bits 15:0 define the pcie speeds supported by the GPU itself. Setting this to CAIL_ASIC_PCIE_LINK_SPEED_SUPPORT_GEN3 (0x00000004) means the asic only supports gen 3. pcie_lane_cap: bits 31:16 define the link width supported by the platform.
Sorry I'm a wee bit confused at what you'd like me to do
I've just tried booting without using the PCIe DPM patch and it now seems to work Not sure if it was a commit or the new firmware - either way this works for me now
What model laptop is this?
Alienware 15 R2
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.