Bug 99387

Summary: Kernel 4.9: Kaveri + Hainan choked on boot using amdgpu
Product: DRI Reporter: Luya Tshimbalanga <luya>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: blocker    
Priority: medium    
Version: DRI git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
Journal report
none
Journal report update
none
Journal report with amdgpu.pg_mask=0 on Hainan
none
possible fix
none
Backported patch for kernel 4.9
none
Traceback on boot with patch
none
Oops 4.9.9
none
oops in 4.9.2
none
possible fix
none
oops with the two patche applied none

Description Luya Tshimbalanga 2017-01-12 19:05:06 UTC
Created attachment 128917 [details]
Journal report

Hardware used in test:
ASUS X550ZE powered with R7 M265DX 
Kaveri (R7 M265DX) + Hainan (R5 M230)
running on Fedora 25

Kernel provided from this repository
https://copr.fedorainfracloud.org/coprs/mystro256/amd-staging-kernel/build/498630/

Enabling "amdgpu.exp_hw_support=1 modprobe.blacklist=radeon" parameter

Result: kernel crashed with signal "BUG: unable to handle kernel NULL pointer dereference at           (null)"

See details attached on the journal log
Comment 1 Jeremy Newton 2017-01-12 19:47:15 UTC
FYI, this is build off of Alexander Deucher's freedesktop mirror, amd-staging-4.9 branch, with the 4.9.2 patch-set and Fedora's patchset (minus the amd patches that conflict or are already applied in this branch):

https://cgit.freedesktop.org/~agd5f/linux/?h=amd-staging-4.9

I'm assuming you're using the latest Fedora snapshot of linux firmware, if so it's a snapshot of this commit:

https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/?id=91ddce492dc0a6a718396e0c79101087134f622d
Comment 2 Luya Tshimbalanga 2017-01-12 22:50:43 UTC
Checking the linux-firmware, the system was still on 20160923 snapshot. Koji has the lastest version so I upgrade to it. I will report if that fixes the issue.
Comment 3 Alex Deucher 2017-01-12 22:53:32 UTC
You don't need amdgpu.exp_hw_support=1  You just have to make sure SI and CIK support are enabled.
Comment 4 Luya Tshimbalanga 2017-01-13 00:57:57 UTC
Thanks for the pointer Alex. I will remove that parameter and will post the result as soon as possible.
Comment 5 Luya Tshimbalanga 2017-01-13 03:12:11 UTC
Created attachment 128923 [details]
Journal report update

Booting without "amdgpu.exp_hw_support=1" parameter and blacklisting radeon driver with latest linux-firmware snapshot 20161205. 
Generated traceback as seen in attachment. See this extract:

[drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jan 12 18:53:38 kernel: [drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Jan 12 18:53:38 kernel: asus_wmi: Number of fans: 0
Jan 12 18:53:38 kernel: audit: type=1130 audit(1484276016.691:63): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/l
Jan 12 18:53:38 kernel: [drm:amdgpu_uvd_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out.
Jan 12 18:53:38 kernel: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 11 (-110).
Jan 12 18:53:38 kernel: [drm] ib test on ring 12 succeeded
Jan 12 18:53:38 kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110).
Comment 6 Michel Dänzer 2017-01-13 03:59:40 UTC
(In reply to Luya Tshimbalanga from comment #5)
> Jan 12 18:53:38 kernel: [drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not
> responding, giving up!!!

There are other bug reports about this. You may be able to avoid it with amdgpu.pg_mask=0 and/or amdgpu.cg_mask=0 on the kernel command line.

Your real problem is presumably the oops while initializing the Hainan GPU though, which probably isn't related to the above.
Comment 7 Luya Tshimbalanga 2017-01-13 06:00:48 UTC
Created attachment 128924 [details]
Journal report with amdgpu.pg_mask=0 on Hainan

> Your real problem is presumably the oops while initializing the Hainan GPU 
> though, which probably isn't related to the above.

It is Hainan GPU causing the problem on amdgpu suggesting SI needs more work. Hopefully the information will enough to provide a solution. Let me know if you need more debugging.
Comment 8 Alex Deucher 2017-01-27 15:34:18 UTC
Created attachment 129182 [details] [review]
possible fix

This patch should fix the crash.
Comment 9 Luya Tshimbalanga 2017-01-28 19:08:51 UTC
(In reply to Alex Deucher from comment #8)
> Created attachment 129182 [details] [review] [review]
> possible fix
> 
> This patch should fix the crash.

Thank you for the fix Alex. I will ask Jerome if he can build the Fedora version of the kernel to test it.
Comment 10 Luya Tshimbalanga 2017-02-04 04:47:19 UTC
Created attachment 129329 [details] [review]
Backported patch for kernel 4.9

Backported the patch for kernel 4.9 for testing purpose. Thanks Alex
Comment 11 Luya Tshimbalanga 2017-02-04 04:59:02 UTC
Scratch the last comment. I just got the built kernel with the patch.
Comment 12 Luya Tshimbalanga 2017-02-04 06:10:08 UTC
Created attachment 129330 [details]
Traceback on boot with patch

Sadly with the patched kernel 4.9 from mystro256 (https://copr.fedorainfracloud.org/coprs/mystro256/gfx-test/build/506611/) , lockup occurred. I was unable to save log and force to hard reset. Here is a screenshot of the result.
Comment 13 Marco 2017-02-10 22:20:32 UTC
Created attachment 129495 [details]
Oops 4.9.9

Jumping in, I have a similar hybrid GPU:
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Sun PRO [Radeon HD 8570A/8570M] (rev ff)
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kabini [Radeon HD 8330] 

Same kernel config:
CONFIG_DRM_AMDGPU=m
CONFIG_DRM_AMDGPU_SI=y
CONFIG_DRM_AMDGPU_CIK=y
CONFIG_DRM_AMDGPU_USERPTR=y

tried now 4.9.9 with the patch landed in stable.
Kernel oops on boot. See attached pstore log message.
Comment 14 Marco 2017-02-10 22:24:31 UTC
Created attachment 129496 [details]
oops in 4.9.2

For the sake of being complete,
find attached log of kernel oops on kernel 4.9.2 (before patch was applied)
Comment 15 Alex Deucher 2017-02-10 23:13:24 UTC
Created attachment 129497 [details] [review]
possible fix

This should fix it (in addition to the previous patch).
Comment 16 Marco 2017-02-11 17:28:25 UTC
Created attachment 129516 [details]
oops with the two patche applied

Patch does not cleanly apply on 4.9.9
but it's easy to port.

But the result is the same.
Find to attached oops with patch applied.

Debugging the faulting instruction (amdgpu_pm_compute_clocks+0x424/0x640 [amdgpu])
lead to:

Reading symbols from drivers/gpu/drm/amd/amdgpu/amdgpu_pm.o...done.
(gdb) list *(amdgpu_pm_compute_clocks+0x424/0x640)
0x1a50 is in amdgpu_pm_compute_clocks (drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c:1280).
1275	{
1276		struct drm_device *ddev = adev->ddev;
1277		struct drm_crtc *crtc;
1278		struct amdgpu_crtc *amdgpu_crtc;
1279	
1280		if (!adev->pm.dpm_enabled)
1281			return;
1282	
1283		if (adev->pp_enabled) {
1284			int i = 0;
(gdb) 
1285	
1286			if (adev->mode_info.num_crtc)
1287				amdgpu_display_bandwidth_update(adev);
1288	
1289			for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
1290				struct amdgpu_ring *ring = adev->rings[i];
1291				if (ring && ring->ready)
1292					amdgpu_fence_wait_empty(ring);
1293			}
1294	

Fualty line seems to be:
1280		if (!adev->pm.dpm_enabled)
Comment 17 Luya Tshimbalanga 2017-02-15 01:29:49 UTC
The fix is working. Hainan video card aka Jet Pro R5 M230 successfully initialized along the Kaveri card

01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Jet PRO [Radeon R5 M230]
	Subsystem: ASUSTeK Computer Inc. Device 130d
	Physical Slot: 0
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 41
	Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M]
	Region 2: Memory at fea00000 (64-bit, non-prefetchable) [size=256K]
	Region 4: I/O ports at e000 [size=256]
	Expansion ROM at fea40000 [disabled] [size=128K]
	Capabilities: [48] Vendor Specific Information: Len=08 <?>
	Capabilities: [50] Power Management version 3
		Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00
		DevCap:	MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited
			ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+
			MaxPayload 256 bytes, MaxReadReq 512 bytes
		DevSta:	CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend-
		LnkCap:	Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us
			ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
		LnkCtl:	ASPM Disabled; RCB 64 bytes Disabled- CommClk+
			ExtSynch- ClockPM- AutWidDis+ BWInt- AutBWInt-
		LnkSta:	Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
		LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
			 Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
			 Compliance De-emphasis: -6dB
		LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
			 EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest+
	Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+
		Address: 00000000fee0f00c  Data: 4163
	Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?>
	Capabilities: [150 v2] Advanced Error Reporting
		UESta:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UEMsk:	DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
		UESvrt:	DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
		CESta:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		CEMsk:	RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
		AERCap:	First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn-
	Capabilities: [270 v1] #19
	Kernel driver in use: amdgpu
	Kernel modules: radeon, amdgpu

Minor issue is optimal power management. Essentially, the Hainan card is running fine. Hopefully Marco will have similar success.
Comment 18 Marco 2017-02-21 21:39:50 UTC
Works fine here too (but only on 4.10) with the two ptches applied.
Comment 19 Luya Tshimbalanga 2017-02-24 18:22:33 UTC
Following up this week with current patched kernel 4.9.11 from Mystro256 COPR repository based on Alex's branch. 
The South Island part of hybrid GPU.e. Hainan and its derivative (Sun PRO) runs smoothly without noticeable bug using via Applications "Launch with Dedicated Graphic Card" on Gnome Shell 3.22.

Looking at the journalctl report sorted by amdgpu:
kernel: [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* displayport link status failed
kernel: [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed

Not sure if that part is fixed as it does not appear affect the whole functionality. If all South Island support works fine with amdgpu module, perhaps will be time to enable it by default on future kernel release. If Marco agree, perhaps this report can be closed as fixed.
Thanks for the hard work.
Comment 20 Luya Tshimbalanga 2017-03-07 04:27:09 UTC
Third week running with the patch. I think this report can be closed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.