Summary: | [amdgpu SI] multigpu setup crashes during boot when dpm=1 | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Arek Ruśniak <arek.rusi> | ||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||
Severity: | normal | ||||||||||||
Priority: | medium | ||||||||||||
Version: | DRI git | ||||||||||||
Hardware: | Other | ||||||||||||
OS: | All | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
Created attachment 126508 [details] dmesg log with latest drm-next-4.9-wip: 72bb0f5 modprobe -r amdgpu: at "148 modprobe amdgpu: at "162 It's some kind of progress, because intel-gfx works but amdgpu doesn't start at all. error from dmesg looks like the same: https://bugs.freedesktop.org/show_bug.cgi?id=97801 kernel: drm-next-4.9-wip: 72bb0f5 Created attachment 126519 [details] dmesg log with latest drm-next-4.9-wip: 97231a9 unfortunately it doesn't when bug 97801 is solved now, kernel log looks like similar for the first log [ 90.544035] amdgpu 0000:01:00.0: fb1: amdgpudrmfb frame buffer device [drm:gfx_v6_0_ring_test_ib [amdgpu]] *ERROR* amdgpu: IB test timed out [drm:amdgpu_ib_ring_tests [amdgpu]] *ERROR* amdgpu: failed testing IB on GFX ring (-110). [drm:amdgpu_device_init [amdgpu]] *ERROR* ib ring test failed (-110). [drm] Initialized amdgpu 3.6.0 20150101 for 0000:01:00.0 on minor 1 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0] For new drm-next-4.9-wip and drm-next-4.9 symptoms are the same as before. I've just tried newest drm-next-4.9-wip: 1c22b05623e5e03ada5a767951eac3203b246be9 and there is something new in kernel log: [ 3430.379659] amdgpu 0000:01:00.0: fb1: amdgpudrmfb frame buffer device [ 3431.438851] [drm:gfx_v6_0_ring_test_ib] *ERROR* amdgpu: IB test timed out [ 3431.438862] [drm:amdgpu_ib_ring_tests] *ERROR* amdgpu: failed testing IB on GFX ring (-110). [ 3431.438866] [drm:amdgpu_device_init] *ERROR* ib ring test failed (-110). [ 3431.871374] [drm:amdgpu_late_init] *ERROR* late_init of IP block <amdgpu_powerplay> failed -22 [ 3431.871381] amdgpu 0000:01:00.0: amdgpu_late_init failed [ 3431.871386] amdgpu 0000:01:00.0: Fatal error during GPU init [ 3431.871390] [drm] amdgpu: finishing device. after that, sysrq works only (and ssh). Created attachment 127149 [details]
full dmesg for next-drm-4.9-wip: 1c22b05
This is reproductible on kernel 4.9-rc7 even with the branch drm-fixes-4.9 from agd5f repository merged https://cgit.freedesktop.org/~agd5f/linux/log/?h=drm-fixes-4.9 The branch was pointing on bcfdd5d5105087e6f33dfeb08a1ca6b2c0287b61 when I merged it. I can use amdgpu with dpm=0 but performance is bad (not surprising for an experimental support on southern island). I would like to bisect but the dpm never worked on amdgpu SI as far as I know. I don't have HW (multiGPU) to test it anymore, so fill free to close it in any time/way you want. I was still getting issues with amdgpu when I tested linux 4.9 + merge from drm-next 4.10 but the dmesg changed (another error code). Will try another merge in next days and post the new error codes. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/96. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 126300 [details] kernel log tested on drm-next-4.9-wip: 1) 832c6ef + 2 patches from Tom and Michel (another bugs) 2) 2c0d731 the same behavior on both. With dpm=0 amdgpu doesn't complain and works with intel. Hard to say it's regression because when I tried DRI_PRIME few month ago dpm didn't work at all.