Bug 100891 - failed to send pre message kernel output delays booting/suspending/resuming
Summary: failed to send pre message kernel output delays booting/suspending/resuming
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-04-30 15:40 UTC by MichaelLong
Modified: 2019-11-19 08:15 UTC (History)
3 users (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg output (kernel 4.9) (146.32 KB, text/plain)
2017-04-30 15:40 UTC, MichaelLong
no flags Details
lspci output (3.69 KB, text/plain)
2017-04-30 15:41 UTC, MichaelLong
no flags Details
kern4.10_dmesg (181.17 KB, text/plain)
2017-05-01 08:42 UTC, Edward O'Callaghan
no flags Details
dmesg log 4.12-rc1 (126.06 KB, text/plain)
2017-05-14 15:55 UTC, MichaelLong
no flags Details
dmesg output of kernel 4.17-rc3 (97.88 KB, text/plain)
2018-04-30 15:00 UTC, MichaelLong
no flags Details
recent lspci output (10.59 KB, text/plain)
2018-04-30 15:02 UTC, MichaelLong
no flags Details
dmesg output of kernel 4.18.4 (99.70 KB, text/plain)
2018-08-22 09:37 UTC, MichaelLong
no flags Details

Description MichaelLong 2017-04-30 15:40:24 UTC
Created attachment 131160 [details]
dmesg output (kernel 4.9)

Starting with kernel version 4.9 the boot process, s3 suspend and s3 resume is delayed by kernel messages like the following

[   17.322912] 
                failed to send pre message 148 ret is 0 
[   17.543482] 
                failed to send message 148 ret is 0 

for a very long time, e.g. 10 min (see attachment for a full dmesg log). When the system is fully booted everything else seems to be fine.


I'm not sure if it is related at all, but I did a quick bisection: between 4.9-rc2 and 4.9-rc3, with the introduction of commit 3b496626ee8f07919256a4e99cddf42ecd4ba891 I had to supply a new firmware file (topaz_k_smc.bin) otherwise my screen remains black. After supplying the missing firmware-file from from linux-firmware.git I noticed the long boot delay and the messages.

Kernel series 4.11-rcX reduces the delays drastically, down to around 1-2 minutes but after a resume from s3 the GPUs gets hot very quickly without fans kicking in until it reaches some sort of emergency mode with short bursts of jet engine like fans.

This happens on the kernels 4.9-rc3 up to the latest 4.11-rc, always tested with the latest set of firmware files.

My GPU-hardware is a Sapphire R9 380X:

05:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Amethyst XT [Radeon R9 M295X Mac Edition] (rev f1)
05:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Device aad8
Comment 1 MichaelLong 2017-04-30 15:41:30 UTC
Created attachment 131161 [details]
lspci output
Comment 2 Edward O'Callaghan 2017-05-01 08:41:44 UTC
Same issue:

$ lspci -vvnn -s 05:00.0
05:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Topaz XT [Radeon R7 M260/M265 / M340/M360 / M440/M445] [1002:6900]
	Subsystem: Lenovo Radeon R7 M260 [17aa:5021]
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 46
	Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at d0000000 (64-bit, prefetchable) [size=2M]
	Region 4: I/O ports at 3000 [size=256]
	Region 5: Memory at f1600000 (32-bit, non-prefetchable) [size=256K]
	Expansion ROM at f1640000 [disabled] [size=128K]
	Capabilities: <access denied>
	Kernel driver in use: amdgpu
	Kernel modules: amdgpu

$ uname -r
4.10.11-200.fc25.x86_64

$ rpm -q libdrm mesa-dri-drivers
libdrm-2.4.79-1.fc25.x86_64
mesa-dri-drivers-13.0.4-3.fc25.x86_64
Comment 3 Edward O'Callaghan 2017-05-01 08:42:12 UTC
Created attachment 131165 [details]
kern4.10_dmesg
Comment 4 MichaelLong 2017-05-14 15:55:44 UTC
Created attachment 131352 [details]
dmesg log 4.12-rc1

Still broken on 4.12-rc1
Comment 5 MichaelLong 2018-04-30 15:00:00 UTC
Update:

From time to time I'm testing this card with newer kernels along with the most recent set of firmware files. The only difference I was able to observe is that the amount of debug messages is less and less with every new kernel. With kernel 4.9 there were so many debug messages emitted that the monitor was in standby for more than 10min, now e.g. with 4.17-rc3 there only a few messages (see newer dmesg output), hardly noticeable.

Unfortunately the result remains the same: Shortly after booting, the GPU temperature is rising constantly, the fans are not starting to spin. My guess is that eventually the card will overheat or go into some sort of emergency mode. I was always shutting down earlier.

In the meantime I was also able to test the card in a few other PC systems. Interestingly the card is working there without any of the above mentioned issues.

The affected mainboard I'm trying to use the card with is an ASUS X99-E WS 3.1 (https://www.asus.com/Commercial-Servers-Workstations/X99E_WSUSB_31/).

Other cards are running fine. Right now I'm using a 'VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Caicos XT [Radeon HD 7470/8470 / R5 235/310 OEM]' without problems.

In the current state this R9 380X is unusable for me. Please let me know if there is any chance to get this fixed. I'm happy to test out any patches if needed.
Comment 6 MichaelLong 2018-04-30 15:00:46 UTC
Created attachment 139229 [details]
dmesg output of kernel 4.17-rc3
Comment 7 MichaelLong 2018-04-30 15:02:59 UTC
Created attachment 139230 [details]
recent lspci output
Comment 8 MichaelLong 2018-08-22 09:37:07 UTC
Created attachment 141239 [details]
dmesg output of kernel 4.18.4

Regular test of the card's working status, this time probably including some more interesting bits (a few excerpts):

10.138008] [drm] dce110_link_encoder_construct: Failed to get encoder_cap_info from VBIOS with error code 4!
[   18.912142] [drm:amdgpu_uvd_ring_test_ib [amdgpu]] *ERROR* amdgpu: (0)IB test timed out.
[   18.932321] [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 12 (-110).
Comment 9 Luke A. Guest 2018-10-01 12:12:29 UTC
I've also been getting these messages on my card:

[435103.682225] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435104.057921] amdgpu: [powerplay] 
                 failed to send message 5e ret is 0 
[435104.811257] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435105.189433] amdgpu: [powerplay] 
                 failed to send message 145 ret is 0 
[435105.941847] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435106.321116] amdgpu: [powerplay] 
                 failed to send message 146 ret is 0 
[435106.702501] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435107.086555] amdgpu: [powerplay] 
                 failed to send message 148 ret is 0 
[435107.842300] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435108.230788] amdgpu: [powerplay] 
                 failed to send message 145 ret is 0 
[435108.999382] amdgpu: [powerplay] 
                 last message was failed ret is 0
[435109.385901] amdgpu: [powerplay] 
                 failed to send message 146 ret is 0 
[445047.844282] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445048.214449] amdgpu: [powerplay] 
                 failed to send message 5e ret is 0 
[445048.953592] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445049.322871] amdgpu: [powerplay] 
                 failed to send message 145 ret is 0 
[445050.061340] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445050.431776] amdgpu: [powerplay] 
                 failed to send message 146 ret is 0 
[445050.801583] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445051.171950] amdgpu: [powerplay] 
                 failed to send message 148 ret is 0 
[445051.912275] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445052.282308] amdgpu: [powerplay] 
                 failed to send message 145 ret is 0 
[445053.023562] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445053.393359] amdgpu: [powerplay] 
                 failed to send message 146 ret is 0 
[445864.746413] amdgpu: [powerplay] 
                 last message was failed ret is 0
[445865.114772] amdgpu: [powerplay] 

Linux rogue 4.19.0-rc4 #1 SMP PREEMPT Sat Sep 22 14:12:36 BST 2018 x86_64 AMD FX(tm)-8350 Eight-Core Processor AuthenticAMD GNU/Linux

01:00.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Tonga PRO [Radeon R9 285/380] (rev f1)
01:00.1 Audio device: Advanced Micro Devices, Inc. [AMD/ATI] Tonga HDMI Audio [Radeon R9 285/380]
Comment 10 Martin Peres 2019-11-19 08:15:59 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/161.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.