Bug 111229 - Unable to unbind GPU from amdgpu
Summary: Unable to unbind GPU from amdgpu
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/AMDgpu (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Default DRI bug account
QA Contact:
Depends on:
Reported: 2019-07-27 04:14 UTC by wedens13
Modified: 2019-10-05 22:13 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:

dmesg kernel 5.2.1 (218.15 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
dmesg kernel 4.19.60 (225.00 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
lspci -vvv before unbind (7.05 KB, text/plain)
2019-07-27 04:14 UTC, wedens13
no flags Details
lspci -vvv after unbind (7.02 KB, text/plain)
2019-07-27 04:15 UTC, wedens13
no flags Details
unbinding without X running (169.43 KB, text/plain)
2019-07-28 11:35 UTC, wedens13
no flags Details
kernel 5.1 (157.84 KB, text/plain)
2019-07-29 13:09 UTC, wedens13
no flags Details
another kernel, another disasterous unbind attempt (3.22 KB, text/plain)
2019-08-06 00:14 UTC, Eugene Shatsky
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description wedens13 2019-07-27 04:14:03 UTC
Created attachment 144877 [details]
dmesg kernel 5.2.1

Arch linux
Kernel version: 5.2.1

I have two GPUs in my system: integrated Intel and Sapphire Pulse Vega 56.
I boot with Intel as my primary gpu and I use Vega for VFIO (gpu passthrough) and gpu offloading.
What I'm trying to do is to boot with amdgpu driver for Vega and bind it to vfio-pci when I start VM (qemu).

The problem occurs when I try to unbind Vega from amdgpu driver using this command:
echo -n "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/unbind

It results in segfault with following error in dmesg (full dmesg from boot to shutdown is attached):
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon

After that I'm unable to rebind device back to amdgpu or any other driver:
echo "0000:03:00.0" > /sys/bus/pci/drivers/amdgpu/bind
bash: echo: write error: No such device

Also I'm unable to shutdown properly. Shutdown process becomes stuck at some point and only holding the button helps.

I've attached relevant lspci -vvv output before and after attempt to unbind, in case it's useful.

Another thing I've tried is to unbind using kernel 4.19.60 and it just hangs after executing the command. I've attached the log of this attempt (error is different from 5.2.1).
Comment 1 wedens13 2019-07-27 04:14:33 UTC
Created attachment 144878 [details]
dmesg kernel 4.19.60
Comment 2 wedens13 2019-07-27 04:14:55 UTC
Created attachment 144879 [details]
lspci -vvv before unbind
Comment 3 wedens13 2019-07-27 04:15:13 UTC
Created attachment 144880 [details]
lspci -vvv after unbind
Comment 4 wedens13 2019-07-27 05:49:34 UTC
My first guess is that unbinding causes GPU reset which is known to leave GPU in a messy state ("the reset bug").
Comment 5 wedens13 2019-07-28 11:35:35 UTC
Created attachment 144896 [details]
unbinding without X running

I've attached a log of attempt to unbind without X running:

systemctl stop sddm
echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true

echo "0000:03:00.0" > /sys/bus/pci/devices/0000:03:00.0/driver/unbind

Result is the same but backtrace seems a bit different. This was done with kernel 5.2.1.

I've tried suspend to ram and another reset bug mitigation (which helps in other cases), but gpu is still unusable after this failed attempt to unbind. I still can't re-bind it to amdgpu or vfio-pci and clean shutdown is not happening.
Comment 6 wedens13 2019-07-28 18:38:35 UTC
Seems to be a regression. 

I can unbind from amdgpu and bind to vfio-pci just fine on kernel 4.19.60-1-lts.

I was able to unbind without previous error after:

echo 0 > /sys/class/vtconsole/vtcon0/bind
echo 0 > /sys/class/vtconsole/vtcon1/bind
echo efi-framebuffer.0 > /sys/bus/platform/drivers/efi-framebuffer/unbind || true
Comment 7 wedens13 2019-07-29 13:09:22 UTC
Created attachment 144907 [details]
kernel 5.1

I've narrowed it down to kernel 5.1. There are a lot of amdgpu changes in 5.1 (Vega related changes specifically). 

I hope someone more knowledgeable in amdgpu will be able to find what exactly in 5.1 breaks unbinding. Let me know if I can help.
Comment 8 Eugene Shatsky 2019-08-06 00:14:17 UTC
Created attachment 144952 [details]
another kernel, another disasterous unbind attempt

I couldn't rebind my RX 470 or shutdown the system cleanly after unbinding it on any kernel my NixOS had since I've got it last winter. Reproduced OPs method for 4.19.64, got severe warnings and oops, "modprobe -r amdgpu" just hangs.
Comment 9 wedens13 2019-09-03 19:06:16 UTC
I'll do more testing, but it seems that unbind works with kernel 5.3-rc7.

There is still this error in the log:
[drm:amdgpu_pci_remove [amdgpu]] *ERROR* Device removal is currently not supported outside of fbcon
without any backtraces and unbind seems to succeed with and without X running (on other gpu, of course).

It'd be nice to have confirmation from other people.

Note that to bind gpu to vfio-pci reset app must be used after unbinding from amdgpu: https://forum.level1techs.com/t/vega-10-and-12-reset-application/145666
Comment 10 Eugene Shatsky 2019-10-05 22:13:02 UTC
I confirm that on on 5.3-rc7 I could unbind/bind RX470 multiple times and shut the system down cleanly afterwards. Got some warning with a trace in dmesg, now going to check if this does affect system stability and whether my goal of switching the Radeon-powered seat between Linux desktop (without persistent session, of course) and virtual machine is now reachable.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.