Bug 111229 - Unable to unbind GPU from amdgpu
Summary: Unable to unbind GPU from amdgpu
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-07-27 04:14 UTC by wedens13
Modified: 2019-11-19 09:37 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
dmesg kernel 5.2.1 (218.15 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
dmesg kernel 4.19.60 (225.00 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
lspci -vvv before unbind (7.05 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
lspci -vvv after unbind (7.02 KB, text/plain)
2019-07-27 04:15 UTC, wedens13
no flags Details
unbinding without X running (169.43 KB, text/plain)
2019-07-28 11:35 UTC, wedens13
no flags Details
kernel 5.1 (157.84 KB, text/plain)
2019-07-29 13:09 UTC, wedens13
no flags Details
another kernel, another disasterous unbind attempt (3.22 KB, text/plain)
2019-08-06 00:14 UTC, Eugene Shatsky
no flags Details

Description wedens13 2019-07-27 04:14:03 UTC
Created attachment 144877 [details]
dmesg kernel 5.2.1

Arch linux
Kernel version: 5.2.1

I have two GPUs in my system: integrated Intel and Sapphire Pulse Vega 56.
I boot with Intel as my primary gpu and I use Vega for VFIO (gpu passthrough) and gpu offloading.
What I'm trying to do is to boot with amdgpu driver for Vega and bind it to vfio-pci when I start VM (qemu).

The problem occurs when I try to unbind Vega from amdgpu driver using this command:
echo -n "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/unbind

It results in segfault with following error in dmesg (full dmesg from boot to shutdown is attached):
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon

After that I'm unable to rebind device back to amdgpu or any other driver:
echo "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/bind
bash: echo: write error: No such device

Also I'm unable to shutdown properly. Shutdown process becomes stuck at some point and only holding the button helps.

I've attached relevant lspci -vvv output before and after attempt to unbind, in case it's useful.

Another thing I've tried is to unbind using kernel 4.19.60 and it just hangs after executing the command. I've attached the log of this attempt (error is different from 5.2.1).
Comment 1 wedens13 2019-07-27 04:14:33 UTC
Created attachment 144878 [details]
dmesg kernel 4.19.60
Comment 2 wedens13 2019-07-27 04:14:55 UTC
Created attachment 144879 [details]
lspci -vvv before unbind
Comment 3 wedens13 2019-07-27 04:15:13 UTC
Created attachment 144880 [details]
lspci -vvv after unbind
Comment 4 wedens13 2019-07-27 05:49:34 UTC
My first guess is that unbinding causes GPU reset which is known to leave GPU in a messy state ("the reset bug").
Comment 5 wedens13 2019-07-28 11:35:35 UTC
Created attachment 144896 [details]
unbinding without X running

I've attached a log of attempt to unbind without X running:

systemctl stop sddm
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true

echo "0000:03:00.0" > /sys/bus/pci/devices/0000:03:00.0/driver/unbind

Result is the same but backtrace seems a bit different. This was done with kernel 5.2.1.

I've tried suspend to ram and another reset bug mitigation (which helps in other cases), but gpu is still unusable after this failed attempt to unbind. I still can't re-bind it to amdgpu or vfio-pci and clean shutdown is not happening.
Comment 6 wedens13 2019-07-28 18:38:35 UTC
Seems to be a regression. 

I can unbind from amdgpu and bind to vfio-pci just fine on kernel 4.19.60-1-lts.

I was able to unbind without previous error after:

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true
Comment 7 wedens13 2019-07-29 13:09:22 UTC
Created attachment 144907 [details]
kernel 5.1

I've narrowed it down to kernel 5.1. There are a lot of amdgpu changes in 5.1 (Vega related changes specifically). 

I hope someone more knowledgeable in amdgpu will be able to find what exactly in 5.1 breaks unbinding. Let me know if I can help.
Comment 8 Eugene Shatsky 2019-08-06 00:14:17 UTC
Created attachment 144952 [details]
another kernel, another disasterous unbind attempt

I couldn't rebind my RX 470 or shutdown the system cleanly after unbinding it on any kernel my NixOS had since I've got it last winter. Reproduced OPs method for 4.19.64, got severe warnings and oops, "modprobe -r amdgpu" just hangs.
Comment 9 wedens13 2019-09-03 19:06:16 UTC
I'll do more testing, but it seems that unbind works with kernel 5.3-rc7.

There is still this error in the log:
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon
without any backtraces and unbind seems to succeed with and without X running (on other gpu, of course).

It'd be nice to have confirmation from other people.

Note that to bind gpu to vfio-pci reset app must be used after unbinding from amdgpu: https://forum.level1techs.com/t/vega-10-and-12-reset-application/145666
Comment 10 Eugene Shatsky 2019-10-05 22:13:02 UTC
I confirm that on on 5.3-rc7 I could unbind/bind RX470 multiple times and shut the system down cleanly afterwards. Got some warning with a trace in dmesg, now going to check if this does affect system stability and whether my goal of switching the Radeon-powered seat between Linux desktop (without persistent session, of course) and virtual machine is now reachable.
Comment 11 Eugene Shatsky 2019-10-21 07:22:46 UTC
Since last comment I've used this for a dozen times for switching between Linux desktop and Windows VM, one time amdgpu crashed after resume from suspend but I'm not sure if it was related to this bug and I was still able to reboot after it.
However I still get this warning sometimes on unbind:

WARNING: CPU: 0 PID: 1109 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:929 amdgpu_bo_unpin+0xc8/0xf0 [amdgpu]
Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio fuse amdgpu amd_iommu_v2 gpu_sched ttm xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_rejec>
 nf_conntrack nf_defrag_ipv4 libcrc32c zsmalloc ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_comm>
CPU: 0 PID: 1109 Comm: .libvirtd-wrapp Tainted: G           O      5.3.0-rc7 #1-NixOS
Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS R2.0, BIOS P1.10 10/01/2013
RIP: 0010:amdgpu_bo_unpin+0xc8/0xf0 [amdgpu]
Code: ff 48 83 c0 0c 48 39 d0 75 ea 48 8d 73 30 48 8d 7b 50 48 8d 54 24 08 e8 46 1f d8 ff 85 c0 74 a1 e9 30 6c 21 00 e8 28 f9 6b f5 <0f> 0b 48 8b >
RSP: 0018:ffffa4df00a4bd28 EFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff8c60449a4800 RCX: 0000000000000002
RDX: ffff8c60423c9b00 RSI: 0000000000000000 RDI: ffff8c60449a4800
RBP: ffff8c6008fa4058 R08: 0000000000000000 R09: ffffffffc0b3c000
R10: ffff8c60449a2800 R11: 0000000000000001 R12: ffff8c6008fa6378
R13: ffff8c6008fa6370 R14: ffff8c6008fa4058 R15: ffff8c6008d7f260
FS:  00007fac9a81f700(0000) GS:ffff8c605f400000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffea51ccff8 CR3: 00000004048c4003 CR4: 00000000001606f0
Call Trace:
 amdgpu_bo_free_kernel+0x6b/0x120 [amdgpu]
 amdgpu_gfx_rlc_fini+0x47/0x70 [amdgpu]
 gfx_v8_0_sw_fini+0xa1/0x1a0 [amdgpu]
 amdgpu_device_fini+0x257/0x479 [amdgpu]
 amdgpu_driver_unload_kms+0x4a/0x90 [amdgpu]
 drm_dev_unregister+0x4b/0xb0 [drm]
 amdgpu_pci_remove+0x25/0x50 [amdgpu]
 pci_device_remove+0x3b/0xc0
 device_release_driver_internal+0xd8/0x1b0
 unbind_store+0x94/0x120
 kernfs_fop_write+0x108/0x190
 vfs_write+0xa5/0x1a0
 ksys_write+0x59/0xd0
 do_syscall_64+0x4e/0x120
 entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7faca4a7b36f
Code: 1f 40 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 53 fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 >
RSP: 002b:00007fac9a81e4d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007faca4a7b36f
RDX: 000000000000000c RSI: 00007fac84019a20 RDI: 0000000000000012
RBP: 00007fac84019a20 R08: 0000000000000000 R09: 000000000000002f
R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000c
R13: 0000000000000000 R14: 0000000000000012 R15: 00007fac9a81e568
---[ end trace ffd153eee3d00ec4 ]---
amdgpu 0000:01:00.0: 00000000001146cc unpin not necessary

It's produced by https://github.com/torvalds/linux/blob/574cc4539762561d96b456dbc0544d8898bd4c6e/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c#L937 , I wonder if buffer object pin count is something like reference count

Also it looks like the message

*ERROR* Device removal is currently not supported outside of fbcon

is printed non-conditionally, without checking if DRM nodes are being used by userspace clients. I wonder if it's possible to implement such a check and prevent the unbind if they are
Comment 12 Andrew B 2019-11-04 04:08:52 UTC
Fedora 31, 5.3.1 kernel, 5700XT - still seeing problems with unbinding from the AMDGPU driver.  

I have video=efifb:off in my kernel parameters to keep the efifb from ever using the card.

After stopping X and unbinding from vtcon0 and vtcon1, attempting to unbind the driver from yields the following error, I cannot bind a new driver to the card, and I can't shutdown cleanly.

[  140.760872] fbcon: Taking over console
[  140.773454] Console: switching to colour frame buffer device 320x90
[  577.562635] Console: switching to colour dummy device 80x25
[  679.403956] VFIO - User Level meta-driver version: 0.3
[  679.410718] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outsid
e of fbcon
[  679.410938] [drm] amdgpu: finishing device.
Comment 13 Andrew B 2019-11-04 04:09:34 UTC
My comment above should have reference 5.3.7 as the kernel version.
Comment 14 wedens13 2019-11-05 10:28:22 UTC
(In reply to Andrew B from comment #13)
> My comment above should have reference 5.3.7 as the kernel version.

For navi you can try this kernel patch:
https://forum.level1techs.com/t/navi-reset-kernel-patch/147547
Comment 15 Martin Peres 2019-11-19 09:37:21 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/878.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.