Bug 97500

Summary: Cannot unbind GPU from AMDGPU
Product: DRI Reporter: Nick Sarnie <sarnex>
Component: DRM/AMDgpuAssignee: Default DRI bug account <dri-devel>
Status: RESOLVED FIXED QA Contact:
Severity: normal    
Priority: medium CC: alexdeucher, ckoenig.leichtzumerken, m-bugs-freedesktop, micaelbergeron, notasas, sarnex, vedran
Version: XOrg git   
Hardware: x86-64 (AMD64)   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
modprobe then rmmod
none
modprobe then X then rmmod
none
dmesg of powerplay crash
none
dmesg: boot then unbind none

Description Nick Sarnie 2016-08-26 15:48:06 UTC
Created attachment 126056 [details]
dmesg

Hi guys. With AMDGPU, if I try to unbind my GPU from amdgpu, I get a hard kernel lockup. My GPU is the RX 480. I'm currently on 4.7.2, but I've tried drm-next-4.9 also. I've tried both echo $CARD_PCI_LOCATION > unbind, and unbinding the vtcon and then modprobe -r amdgpu. When I had my HD 7950, the echo $CARD_PCI_LOCATION > unbind method worked every time. I'm not sure how to get any debug info, since even an ssh session locks up. I tried using pstore, but no logs were saved.

Please let me know if you have any ideas or need any more info.

Thanks,
Sarnex
Comment 1 Grazvydas Ignotas 2016-09-01 00:17:47 UTC
Same problem here with RX 470. The happens both on rmmod'ing amdgpu or attempting to unbind, even when using the drm-next-4.9-wip branch.

It would be great if this worked to be able to switch to GPU passthrough without a reboot. The windows driver seems to be already behaving well, I can start/shutdown the vm multiple times and then hand over the card to amdgpu without problems, only taking it away from amdgpu locks up the machine.
Comment 2 Michel Dänzer 2016-09-01 00:47:31 UTC
I get bug updates via the mailing list.
Comment 3 Micael Bergeron 2016-09-13 01:03:41 UTC
I also have this behavior on Linux 4.7.2 
I tried either to unbind on /sys/bus/pci/drivers/amdgpu/unbind or remove the device and trigger a rescan using /sys/bus/pci/devices/.../remove, /sys/bus/pci/rescan.

I have kernel panics and/or system hangs either way.
It would be awesome to be able to yield the GPU to a VM then claim it back when finished.
Comment 4 Grazvydas Ignotas 2016-09-24 21:23:26 UTC
Created attachment 126768 [details]
modprobe then rmmod

I've been trying today's drm-next-4.9-wip (merged with 4.8.0-rc7) and the situation has improved somewhat, doing rmmod just after modprobe succeeds with a WARN from TTM, but attempts to modprobe it again are failing. If X session is started/stopped before rmmod, the consequences are more severe, looks like some sort of corruption.
Comment 5 Grazvydas Ignotas 2016-09-24 21:25:33 UTC
Created attachment 126769 [details]
modprobe then X then rmmod

dmasg if X session was used before rmmod
Comment 6 Grazvydas Ignotas 2016-09-25 21:06:31 UTC
Created attachment 126782 [details]
dmesg of powerplay crash

I've sent some patches with fixes, but there seem to be multiple other issues.

One of the problems is that struct amdgpu_i2c_chan contains struct drm_dp_aux, and on amdgpu_i2c_fini() call, which frees amdgpu_i2c_chan, drm_dp_aux is still in use. This causes memory corruption. Don't know how to solve this, perhaps somebody knows this code better?
A hack can be used to trade this corruption for a leak:

diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_i2c.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_i2c.c
index 34bab61..8beaee0 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_i2c.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_i2c.c
@@ -221,6 +221,8 @@ void amdgpu_i2c_destroy(struct amdgpu_i2c_chan *i2c)
        if (!i2c)
                return;
        i2c_del_adapter(&i2c->adapter);
+       if (i2c->has_aux)
+               return;
        kfree(i2c);
 }
 
---
Another one is TTM leak, can also be seen in this attachment.
CONFIG_DMA_API_DEBUG reports:

WARNING: CPU: 3 PID: 1666 at lib/dma-debug.c:976 dma_debug_device_change+0x1ca/0x240
pci 0000:01:00.0: DMA-API: device driver has pending DMA allocations while released from device [count=202]
One of leaked entries details: [device address=0x00000003dcfe9000] [size=4096 bytes] [mapped with DMA_BIDIRECTIONAL] [mapped as coherent]

Mapped at:
 [<ffffffff8163d941>] debug_dma_alloc_coherent+0x41/0x110
 [<ffffffffa0728d84>] ttm_dma_populate+0xb64/0x1150 [ttm]
 [<ffffffffa0b770ac>] amdgpu_ttm_tt_populate+0x35c/0x510 [amdgpu]
 [<ffffffffa0719141>] ttm_tt_bind+0x71/0xd0 [ttm]
 [<ffffffffa071c9d8>] ttm_bo_handle_move_mem+0xa08/0xaa0 [ttm]

---
Next one is powerplay crash in drivers/gpu/drm/amd/powerplay/hwmgr/smu7_hwmgr.c:3336 , dpm_table->sclk_table.count is 0 so array access ends up badly. Could be related to "DPM is already running right now, no need to enable DPM!" message, full dmesg attached.

I won't have time to work on this for a while, but maybe somebody else does.
Comment 7 Andreas Grosse 2016-10-04 15:42:00 UTC
Created attachment 126997 [details]
dmesg: boot then unbind

I am getting a kernel panic with Linux 4.8.0 when I unbind my RX480 (XFX Radeon RX 480 GTR Black 8GB, if that helps) from amdgpu. The system freezes immediately and only pushes this to the serial console (which is why it is not in the attached dmesg):

[   80.266963] {1}[Hardware Error]: event severity: fatal
[   80.266964] {1}[Hardware Error]:  Error 0, type: fatal
[   80.266964] {1}[Hardware Error]:   section_type: PCIe error
[   80.266964] {1}[Hardware Error]:   port_type: 4, root port
[   80.266965] {1}[Hardware Error]:   version: 1.16
[   80.266965] {1}[Hardware Error]:   command: 0x4010, status: 0x0547
[   80.266965] {1}[Hardware Error]:   device_id: 0000:00:01.0
[   80.266966] {1}[Hardware Error]:   slot: 0
[   80.266966] {1}[Hardware Error]:   secondary_bus: 0x01
[   80.266966] {1}[Hardware Error]:   vendor_id: 0x8086, device_id: 0x0c01
[   80.266967] {1}[Hardware Error]:   class_code: 000406
[   80.266967] {1}[Hardware Error]:   bridge: secondary_status: 0x2000, control: 0x0003
[   80.266967] Kernel panic - not syncing: Fatal hardware error!
[   80.705884] Kernel Offset: disabled

Is this the same issue?

I have attached the kernel output from boot until I got the panic after unbinding with this command:
echo 0000:01:00.0 > /sys/bus/pci/drivers/amdgpu/unbind

Is there any further information that I can provide to address this issue?
Comment 8 Grazvydas Ignotas 2016-10-29 15:06:40 UTC
Finally it's behaving properly for me when using today's drm-next-4.10-wip branch with this patch on top:
https://lists.freedesktop.org/archives/amd-gfx/2016-October/003141.html
Comment 9 Nick Sarnie 2016-10-29 16:37:52 UTC
(In reply to Grazvydas Ignotas from comment #8)
> Finally it's behaving properly for me when using today's drm-next-4.10-wip
> branch with this patch on top:
> https://lists.freedesktop.org/archives/amd-gfx/2016-October/003141.html

I can confirm that GPU unbinding works as expected with this setup. I'm getting constant GPU hangs and weird behavior when using my Intel GPU, but I can't imagine it's related to these changes.

Great work,
Sarnex
Comment 10 Christian König 2016-10-31 07:44:38 UTC
I hoped that this fix might help with this bug as well, but I couldn't find the bug report again of hand.

Good to see that fixed as well. Please close the bug report as soon as you can confirm that it works on an upstream kernel.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.