Summary: | Kernel 4.9: Kaveri + Hainan choked on boot using amdgpu | ||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Luya Tshimbalanga <luya> | ||||||||||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||||
Status: | RESOLVED FIXED | QA Contact: | |||||||||||||||||||||||
Severity: | blocker | ||||||||||||||||||||||||
Priority: | medium | ||||||||||||||||||||||||
Version: | DRI git | ||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||
Attachments: |
|
Description
Luya Tshimbalanga
2017-01-12 19:05:06 UTC
FYI, this is build off of Alexander Deucher's freedesktop mirror, amd-staging-4.9 branch, with the 4.9.2 patch-set and Fedora's patchset (minus the amd patches that conflict or are already applied in this branch): https://cgit.freedesktop.org/~agd5f/linux/?h=amd-staging-4.9 I'm assuming you're using the latest Fedora snapshot of linux firmware, if so it's a snapshot of this commit: https://git.kernel.org/cgit/linux/kernel/git/firmware/linux-firmware.git/commit/?id=91ddce492dc0a6a718396e0c79101087134f622d Checking the linux-firmware, the system was still on 20160923 snapshot. Koji has the lastest version so I upgrade to it. I will report if that fixes the issue. You don't need amdgpu.exp_hw_support=1 You just have to make sure SI and CIK support are enabled. Thanks for the pointer Alex. I will remove that parameter and will post the result as soon as possible. Created attachment 128923 [details]
Journal report update
Booting without "amdgpu.exp_hw_support=1" parameter and blacklisting radeon driver with latest linux-firmware snapshot 20161205.
Generated traceback as seen in attachment. See this extract:
[drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not responding, trying to reset the VCPU!!!
Jan 12 18:53:38 kernel: [drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not responding, giving up!!!
Jan 12 18:53:38 kernel: asus_wmi: Number of fans: 0
Jan 12 18:53:38 kernel: audit: type=1130 audit(1484276016.691:63): pid=1 uid=0 auid=4294967295 ses=4294967295 subj=system_u:system_r:init_t:s0 msg='unit=systemd-rfkill comm="systemd" exe="/usr/l
Jan 12 18:53:38 kernel: [drm:amdgpu_uvd_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out.
Jan 12 18:53:38 kernel: [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on ring 11 (-110).
Jan 12 18:53:38 kernel: [drm] ib test on ring 12 succeeded
Jan 12 18:53:38 kernel: [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110).
(In reply to Luya Tshimbalanga from comment #5) > Jan 12 18:53:38 kernel: [drm:uvd_v4_2_start [amdgpu]] *ERROR* UVD not > responding, giving up!!! There are other bug reports about this. You may be able to avoid it with amdgpu.pg_mask=0 and/or amdgpu.cg_mask=0 on the kernel command line. Your real problem is presumably the oops while initializing the Hainan GPU though, which probably isn't related to the above. Created attachment 128924 [details] Journal report with amdgpu.pg_mask=0 on Hainan > Your real problem is presumably the oops while initializing the Hainan GPU > though, which probably isn't related to the above. It is Hainan GPU causing the problem on amdgpu suggesting SI needs more work. Hopefully the information will enough to provide a solution. Let me know if you need more debugging. Created attachment 129182 [details] [review] possible fix This patch should fix the crash. (In reply to Alex Deucher from comment #8) > Created attachment 129182 [details] [review] [review] > possible fix > > This patch should fix the crash. Thank you for the fix Alex. I will ask Jerome if he can build the Fedora version of the kernel to test it. Created attachment 129329 [details] [review] Backported patch for kernel 4.9 Backported the patch for kernel 4.9 for testing purpose. Thanks Alex Scratch the last comment. I just got the built kernel with the patch. Created attachment 129330 [details] Traceback on boot with patch Sadly with the patched kernel 4.9 from mystro256 (https://copr.fedorainfracloud.org/coprs/mystro256/gfx-test/build/506611/) , lockup occurred. I was unable to save log and force to hard reset. Here is a screenshot of the result. Created attachment 129495 [details]
Oops 4.9.9
Jumping in, I have a similar hybrid GPU:
01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Sun PRO [Radeon HD 8570A/8570M] (rev ff)
00:01.0 VGA compatible controller: Advanced Micro Devices, Inc. [AMD/ATI] Kabini [Radeon HD 8330]
Same kernel config:
CONFIG_DRM_AMDGPU=m
CONFIG_DRM_AMDGPU_SI=y
CONFIG_DRM_AMDGPU_CIK=y
CONFIG_DRM_AMDGPU_USERPTR=y
tried now 4.9.9 with the patch landed in stable.
Kernel oops on boot. See attached pstore log message.
Created attachment 129496 [details]
oops in 4.9.2
For the sake of being complete,
find attached log of kernel oops on kernel 4.9.2 (before patch was applied)
Created attachment 129497 [details] [review] possible fix This should fix it (in addition to the previous patch). Created attachment 129516 [details]
oops with the two patche applied
Patch does not cleanly apply on 4.9.9
but it's easy to port.
But the result is the same.
Find to attached oops with patch applied.
Debugging the faulting instruction (amdgpu_pm_compute_clocks+0x424/0x640 [amdgpu])
lead to:
Reading symbols from drivers/gpu/drm/amd/amdgpu/amdgpu_pm.o...done.
(gdb) list *(amdgpu_pm_compute_clocks+0x424/0x640)
0x1a50 is in amdgpu_pm_compute_clocks (drivers/gpu/drm/amd/amdgpu/amdgpu_pm.c:1280).
1275 {
1276 struct drm_device *ddev = adev->ddev;
1277 struct drm_crtc *crtc;
1278 struct amdgpu_crtc *amdgpu_crtc;
1279
1280 if (!adev->pm.dpm_enabled)
1281 return;
1282
1283 if (adev->pp_enabled) {
1284 int i = 0;
(gdb)
1285
1286 if (adev->mode_info.num_crtc)
1287 amdgpu_display_bandwidth_update(adev);
1288
1289 for (i = 0; i < AMDGPU_MAX_RINGS; i++) {
1290 struct amdgpu_ring *ring = adev->rings[i];
1291 if (ring && ring->ready)
1292 amdgpu_fence_wait_empty(ring);
1293 }
1294
Fualty line seems to be:
1280 if (!adev->pm.dpm_enabled)
The fix is working. Hainan video card aka Jet Pro R5 M230 successfully initialized along the Kaveri card 01:00.0 Display controller: Advanced Micro Devices, Inc. [AMD/ATI] Jet PRO [Radeon R5 M230] Subsystem: ASUSTeK Computer Inc. Device 130d Physical Slot: 0 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 41 Region 0: Memory at c0000000 (64-bit, prefetchable) [size=256M] Region 2: Memory at fea00000 (64-bit, non-prefetchable) [size=256K] Region 4: I/O ports at e000 [size=256] Expansion ROM at fea40000 [disabled] [size=128K] Capabilities: [48] Vendor Specific Information: Len=08 <?> Capabilities: [50] Power Management version 3 Flags: PMEClk- DSI- D1+ D2+ AuxCurrent=0mA PME(D0-,D1+,D2+,D3hot+,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [58] Express (v2) Legacy Endpoint, MSI 00 DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s <4us, L1 unlimited ExtTag+ AttnBtn- AttnInd- PwrInd- RBE+ FLReset- DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop+ MaxPayload 256 bytes, MaxReadReq 512 bytes DevSta: CorrErr+ UncorrErr- FatalErr- UnsuppReq+ AuxPwr- TransPend- LnkCap: Port #0, Speed 5GT/s, Width x8, ASPM L0s L1, Exit Latency L0s <64ns, L1 <1us ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+ LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk+ ExtSynch- ClockPM- AutWidDis+ BWInt- AutBWInt- LnkSta: Speed 2.5GT/s, Width x8, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis- Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS- Compliance De-emphasis: -6dB LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1- EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest+ Capabilities: [a0] MSI: Enable+ Count=1/1 Maskable- 64bit+ Address: 00000000fee0f00c Data: 4163 Capabilities: [100 v1] Vendor Specific Information: ID=0001 Rev=1 Len=010 <?> Capabilities: [150 v2] Advanced Error Reporting UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol- UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol- CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+ AERCap: First Error Pointer: 00, GenCap+ CGenEn- ChkCap+ ChkEn- Capabilities: [270 v1] #19 Kernel driver in use: amdgpu Kernel modules: radeon, amdgpu Minor issue is optimal power management. Essentially, the Hainan card is running fine. Hopefully Marco will have similar success. Works fine here too (but only on 4.10) with the two ptches applied. Following up this week with current patched kernel 4.9.11 from Mystro256 COPR repository based on Alex's branch. The South Island part of hybrid GPU.e. Hainan and its derivative (Sun PRO) runs smoothly without noticeable bug using via Applications "Launch with Dedicated Graphic Card" on Gnome Shell 3.22. Looking at the journalctl report sorted by amdgpu: kernel: [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* displayport link status failed kernel: [drm:amdgpu_atombios_dp_link_train [amdgpu]] *ERROR* clock recovery failed Not sure if that part is fixed as it does not appear affect the whole functionality. If all South Island support works fine with amdgpu module, perhaps will be time to enable it by default on future kernel release. If Marco agree, perhaps this report can be closed as fixed. Thanks for the hard work. Third week running with the patch. I think this report can be closed. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.