Created attachment 144877 [details] dmesg kernel 5.2.1 Arch linux Kernel version: 5.2.1 I have two GPUs in my system: integrated Intel and Sapphire Pulse Vega 56. I boot with Intel as my primary gpu and I use Vega for VFIO (gpu passthrough) and gpu offloading. What I'm trying to do is to boot with amdgpu driver for Vega and bind it to vfio-pci when I start VM (qemu). The problem occurs when I try to unbind Vega from amdgpu driver using this command: echo -n "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/unbind It results in segfault with following error in dmesg (full dmesg from boot to shutdown is attached): [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon After that I'm unable to rebind device back to amdgpu or any other driver: echo "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/bind bash: echo: write error: No such device Also I'm unable to shutdown properly. Shutdown process becomes stuck at some point and only holding the button helps. I've attached relevant lspci -vvv output before and after attempt to unbind, in case it's useful. Another thing I've tried is to unbind using kernel 4.19.60 and it just hangs after executing the command. I've attached the log of this attempt (error is different from 5.2.1).
Created attachment 144878 [details] dmesg kernel 4.19.60
Created attachment 144879 [details] lspci -vvv before unbind
Created attachment 144880 [details] lspci -vvv after unbind
My first guess is that unbinding causes GPU reset which is known to leave GPU in a messy state ("the reset bug").
Created attachment 144896 [details] unbinding without X running I've attached a log of attempt to unbind without X running: systemctl stop sddm echo 0 > /sys/class/vtconsole/vtcon0/bind echo 0 > /sys/class/vtconsole/vtcon1/bind echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true echo "0000:03:00.0" > /sys/bus/pci/devices/0000:03:00.0/driver/unbind Result is the same but backtrace seems a bit different. This was done with kernel 5.2.1. I've tried suspend to ram and another reset bug mitigation (which helps in other cases), but gpu is still unusable after this failed attempt to unbind. I still can't re-bind it to amdgpu or vfio-pci and clean shutdown is not happening.
Seems to be a regression. I can unbind from amdgpu and bind to vfio-pci just fine on kernel 4.19.60-1-lts. I was able to unbind without previous error after: echo 0 > /sys/class/vtconsole/vtcon0/bind echo 0 > /sys/class/vtconsole/vtcon1/bind echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true
Created attachment 144907 [details] kernel 5.1 I've narrowed it down to kernel 5.1. There are a lot of amdgpu changes in 5.1 (Vega related changes specifically). I hope someone more knowledgeable in amdgpu will be able to find what exactly in 5.1 breaks unbinding. Let me know if I can help.
Created attachment 144952 [details] another kernel, another disasterous unbind attempt I couldn't rebind my RX 470 or shutdown the system cleanly after unbinding it on any kernel my NixOS had since I've got it last winter. Reproduced OPs method for 4.19.64, got severe warnings and oops, "modprobe -r amdgpu" just hangs.
I'll do more testing, but it seems that unbind works with kernel 5.3-rc7. There is still this error in the log: [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon without any backtraces and unbind seems to succeed with and without X running (on other gpu, of course). It'd be nice to have confirmation from other people. Note that to bind gpu to vfio-pci reset app must be used after unbinding from amdgpu: https://forum.level1techs.com/t/vega-10-and-12-reset-application/145666
I confirm that on on 5.3-rc7 I could unbind/bind RX470 multiple times and shut the system down cleanly afterwards. Got some warning with a trace in dmesg, now going to check if this does affect system stability and whether my goal of switching the Radeon-powered seat between Linux desktop (without persistent session, of course) and virtual machine is now reachable.
Since last comment I've used this for a dozen times for switching between Linux desktop and Windows VM, one time amdgpu crashed after resume from suspend but I'm not sure if it was related to this bug and I was still able to reboot after it. However I still get this warning sometimes on unbind: WARNING: CPU: 0 PID: 1109 at drivers/gpu/drm/amd/amdgpu/amdgpu_object.c:929 amdgpu_bo_unpin+0xc8/0xf0 [amdgpu] Modules linked in: vfio_pci vfio_virqfd vfio_iommu_type1 vfio fuse amdgpu amd_iommu_v2 gpu_sched ttm xt_CHECKSUM xt_MASQUERADE ipt_REJECT nf_rejec> nf_conntrack nf_defrag_ipv4 libcrc32c zsmalloc ip6t_rpfilter ipt_rpfilter ip6table_raw iptable_raw xt_pkttype nf_log_ipv6 nf_log_ipv4 nf_log_comm> CPU: 0 PID: 1109 Comm: .libvirtd-wrapp Tainted: G O 5.3.0-rc7 #1-NixOS Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./H61M-DGS R2.0, BIOS P1.10 10/01/2013 RIP: 0010:amdgpu_bo_unpin+0xc8/0xf0 [amdgpu] Code: ff 48 83 c0 0c 48 39 d0 75 ea 48 8d 73 30 48 8d 7b 50 48 8d 54 24 08 e8 46 1f d8 ff 85 c0 74 a1 e9 30 6c 21 00 e8 28 f9 6b f5 <0f> 0b 48 8b > RSP: 0018:ffffa4df00a4bd28 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8c60449a4800 RCX: 0000000000000002 RDX: ffff8c60423c9b00 RSI: 0000000000000000 RDI: ffff8c60449a4800 RBP: ffff8c6008fa4058 R08: 0000000000000000 R09: ffffffffc0b3c000 R10: ffff8c60449a2800 R11: 0000000000000001 R12: ffff8c6008fa6378 R13: ffff8c6008fa6370 R14: ffff8c6008fa4058 R15: ffff8c6008d7f260 FS: 00007fac9a81f700(0000) GS:ffff8c605f400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffea51ccff8 CR3: 00000004048c4003 CR4: 00000000001606f0 Call Trace: amdgpu_bo_free_kernel+0x6b/0x120 [amdgpu] amdgpu_gfx_rlc_fini+0x47/0x70 [amdgpu] gfx_v8_0_sw_fini+0xa1/0x1a0 [amdgpu] amdgpu_device_fini+0x257/0x479 [amdgpu] amdgpu_driver_unload_kms+0x4a/0x90 [amdgpu] drm_dev_unregister+0x4b/0xb0 [drm] amdgpu_pci_remove+0x25/0x50 [amdgpu] pci_device_remove+0x3b/0xc0 device_release_driver_internal+0xd8/0x1b0 unbind_store+0x94/0x120 kernfs_fop_write+0x108/0x190 vfs_write+0xa5/0x1a0 ksys_write+0x59/0xd0 do_syscall_64+0x4e/0x120 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7faca4a7b36f Code: 1f 40 00 41 54 55 49 89 d4 53 48 89 f5 89 fb 48 83 ec 10 e8 53 fd ff ff 4c 89 e2 41 89 c0 48 89 ee 89 df b8 01 00 00 00 0f 05 <48> 3d 00 f0 > RSP: 002b:00007fac9a81e4d0 EFLAGS: 00000293 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000012 RCX: 00007faca4a7b36f RDX: 000000000000000c RSI: 00007fac84019a20 RDI: 0000000000000012 RBP: 00007fac84019a20 R08: 0000000000000000 R09: 000000000000002f R10: 0000000000000000 R11: 0000000000000293 R12: 000000000000000c R13: 0000000000000000 R14: 0000000000000012 R15: 00007fac9a81e568 ---[ end trace ffd153eee3d00ec4 ]--- amdgpu 0000:01:00.0: 00000000001146cc unpin not necessary It's produced by https://github.com/torvalds/linux/blob/574cc4539762561d96b456dbc0544d8898bd4c6e/drivers/gpu/drm/amd/amdgpu/amdgpu_object.c#L937 , I wonder if buffer object pin count is something like reference count Also it looks like the message *ERROR* Device removal is currently not supported outside of fbcon is printed non-conditionally, without checking if DRM nodes are being used by userspace clients. I wonder if it's possible to implement such a check and prevent the unbind if they are
Fedora 31, 5.3.1 kernel, 5700XT - still seeing problems with unbinding from the AMDGPU driver. I have video=efifb:off in my kernel parameters to keep the efifb from ever using the card. After stopping X and unbinding from vtcon0 and vtcon1, attempting to unbind the driver from yields the following error, I cannot bind a new driver to the card, and I can't shutdown cleanly. [ 140.760872] fbcon: Taking over console [ 140.773454] Console: switching to colour frame buffer device 320x90 [ 577.562635] Console: switching to colour dummy device 80x25 [ 679.403956] VFIO - User Level meta-driver version: 0.3 [ 679.410718] [drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outsid e of fbcon [ 679.410938] [drm] amdgpu: finishing device.
My comment above should have reference 5.3.7 as the kernel version.
(In reply to Andrew B from comment #13) > My comment above should have reference 5.3.7 as the kernel version. For navi you can try this kernel patch: https://forum.level1techs.com/t/navi-reset-kernel-patch/147547
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/878.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.