Bug 97634

Summary: [amdgpu SI] multigpu setup crashes during boot when dpm=1
Product: DRI Reporter: Arek Ruśniak <arek.rusi>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED MOVED QA Contact:
Severity: normal    
Priority: medium    
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
kernel log
none
dmesg log with latest drm-next-4.9-wip: 72bb0f5
none
dmesg log with latest drm-next-4.9-wip: 97231a9
none
full dmesg for next-drm-4.9-wip: 1c22b05 none

Description Arek Ruśniak 2016-09-08 09:27:38 UTC
Created attachment 126300 [details]
kernel log

tested on drm-next-4.9-wip:
1) 832c6ef + 2 patches from Tom and Michel (another bugs)
2) 2c0d731 
the same behavior on both. 

With dpm=0 amdgpu doesn't complain and works with intel. 

Hard to say it's regression because when I tried DRI_PRIME few month ago dpm didn't work at all.
Comment 1 Arek Ruśniak 2016-09-13 22:39:05 UTC
Created attachment 126508 [details]
dmesg log with latest drm-next-4.9-wip: 72bb0f5

modprobe -r amdgpu: at "148 
modprobe amdgpu: at "162

It's some kind of progress, because intel-gfx works but amdgpu doesn't start at all. error from dmesg looks like the same: https://bugs.freedesktop.org/show_bug.cgi?id=97801 

kernel: drm-next-4.9-wip: 72bb0f5
Comment 2 Alex Deucher 2016-09-14 13:52:16 UTC
probably a duplicate of bug 97801
Comment 3 Arek Ruśniak 2016-09-14 14:50:35 UTC
Created attachment 126519 [details]
dmesg log with latest drm-next-4.9-wip: 97231a9

unfortunately it doesn't
when bug 97801 is solved now, kernel log looks like similar for the first log

[   90.544035] 
amdgpu 0000:01:00.0: fb1: amdgpudrmfb frame buffer device
[drm:gfx_v6_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out
[drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on GFX ring (-110).
[drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110).
[drm] Initialized amdgpu 3.6.0 20150101 for 0000:01:00.0 on minor 1
NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
Comment 4 Arek Ruśniak 2016-09-18 16:19:02 UTC
For new drm-next-4.9-wip and drm-next-4.9 symptoms are the same as before.
Comment 5 Arek Ruśniak 2016-10-09 08:03:31 UTC
I've just tried newest drm-next-4.9-wip: 	
1c22b05623e5e03ada5a767951eac3203b246be9

and there is something new in kernel log:

[ 3430.379659] amdgpu 0000:01:00.0: fb1: amdgpudrmfb frame buffer device
[ 3431.438851] [drm:gfx_v6_0_ring_test_ib] *ERROR* amdgpu: IB test timed out        
[ 3431.438862] [drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on GFX ring (-110).
[ 3431.438866] [drm:amdgpu_device_init] *ERROR* ib ring test failed (-110).         
[ 3431.871374] [drm:amdgpu_late_init] *ERROR* late_init of IP block <amdgpu_powerplay> failed -22
[ 3431.871381] amdgpu 0000:01:00.0: amdgpu_late_init failed                         
[ 3431.871386] amdgpu 0000:01:00.0: Fatal error during GPU init                     
[ 3431.871390] [drm] amdgpu: finishing device.

after that, sysrq works only (and ssh).
Comment 6 Arek Ruśniak 2016-10-09 08:06:03 UTC
Created attachment 127149 [details]
full dmesg for next-drm-4.9-wip: 1c22b05
Comment 7 Robin KERDILES 2016-11-30 14:46:34 UTC
This is reproductible on kernel 4.9-rc7 even with the branch drm-fixes-4.9  from agd5f repository merged
https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-4.9
The branch was pointing on bcfdd5d5105087e6f33dfeb08a1ca6b2c0287b61 when I merged it.
I can use amdgpu with dpm=0 but performance is bad (not surprising for an experimental support on southern island).
I would like to bisect but the dpm never worked on amdgpu SI as far as I know.
Comment 8 Arek Ruśniak 2017-01-02 21:35:10 UTC
I don't have HW (multiGPU) to test it anymore, so fill free to close it in any time/way you want.
Comment 9 Robin KERDILES 2017-01-05 04:06:34 UTC
I was still getting issues with amdgpu when I tested linux 4.9 + merge from drm-next 4.10 but the dmesg changed (another error code).

Will try another merge in next days and post the new error codes.
Comment 10 Martin Peres 2019-11-19 08:09:56 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/96.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.