101946 – Rebinding AMDGPU causes initialization errors [R9 290]

Bug 101946 - Rebinding AMDGPU causes initialization errors [R9 290]

Summary: Rebinding AMDGPU causes initialization errors [R9 290]

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2017-07-27 11:46 UTC by Robin
Modified:	2019-11-19 08:20 UTC (History)
CC List:	2 users (show)

See Also:	111229
i915 platform:
i915 features:

Attachments
The script used to reproduce the error. (1.36 KB, application/x-shellscript) 2017-07-27 11:46 UTC, Robin	no flags	Details
Script output (1.80 KB, text/plain) 2017-07-27 11:47 UTC, Robin	no flags	Details
kern.log (133.30 KB, text/plain) 2017-07-27 11:47 UTC, Robin	no flags	Details
possible fix 1/2 (2.02 KB, patch) 2017-07-27 14:42 UTC, Alex Deucher	no flags	Details \| Splinter Review
possible fix 2/2 (4.15 KB, patch) 2017-07-27 14:42 UTC, Alex Deucher	no flags	Details \| Splinter Review
kern.log for drm-next-4.14-wip (133.32 KB, text/plain) 2017-07-28 11:46 UTC, Robin	no flags	Details
possible fix 3/2 (987 bytes, patch) 2017-07-28 14:24 UTC, Alex Deucher	no flags	Details \| Splinter Review
kern.log for drm-next-4.14-wip with patch 3 (143.91 KB, text/plain) 2017-07-28 15:19 UTC, Robin	no flags	Details
4.13rc2 ubuntu kern.log (142.64 KB, text/x-log) 2017-07-28 15:37 UTC, Robin	no flags	Details
case2-rescan-amd.sh (1.09 KB, application/x-shellscript) 2017-07-28 17:26 UTC, Robin	no flags	Details
case3.sh (1.67 KB, application/x-shellscript) 2017-07-28 20:55 UTC, Robin	no flags	Details
Logging shutdown function (801 bytes, patch) 2017-07-29 16:05 UTC, Robin	no flags	Details \| Splinter Review
Brute-force fix, resets sdma every init (1.18 KB, patch) 2017-07-29 22:22 UTC, Robin	no flags	Details \| Splinter Review
Test of above with R9 380 with Windows 8.1 and latest AMD drivers (6.03 KB, application/x-bzip2) 2017-08-01 14:28 UTC, Luke A. Guest	no flags	Details
Script to rebind a device back to the vfio-pci driver (434 bytes, application/x-shellscript) 2017-08-01 15:16 UTC, Luke A. Guest	no flags	Details
View All

Description Robin 2017-07-27 11:46:28 UTC

Created attachment 133068 [details]
The script used to reproduce the error.

As I attempted to hotplug my R9 290 for a VM gaming setup, I stumbled on this issue.

The main kern.log error to come up is:

> [  160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed
> [  160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22
> [  160.014531] amdgpu 0000:01:00.0: amdgpu_init failed


For my setup I use a Kaby Lake iGPU running i915.
With the R9 290 using vfio-pci / amdgpu.
Ubuntu 17.04 (4.10.0-28-generic).
Mesa 17.1.4 from the padoka stable PPA.


I'm able to reproduce this as follows.

1. Boot with vfio-pci capturing the card and amdgpu blacklisted. Kernel flags:
> intel_iommu=on iommu=pt vfio-pci.ids=1002:67b1,1002:aac8

2. Since I run Gnome3 on Ubuntu 17.04, this will bring me to a wayland greeter which uses my iGPU. Drop to a free TTY, without logging in. This prevents Xorg from responding to the AMD card becoming available.

3. Run the attached script "rebind-amd.sh" as root to bind back and forth between vfio-pci and amdgpu in an infinite loop.

This will:

A. modprobe both drivers to be sure they're loaded.
B. Print information about the driver and card usage.
C. Use the new_id > unbind > bind > remove_id sequence to switch drivers.

What happens is:

vfio-pci -> vfio-pci, Gives no problems, of course.
vfio-pci -> amdgpu, This works and the amdgpu driver initializes the card. Attached monitor(s) start searching for signals.
amdgpu -> vfio-pci, Since no Xorg is using the dGPU this works without problems.
vfio-pci -> amdgpu, Fails to initialize dGPU with the kernel error above.


I've attached the script, the output of the script and the full kern.log.

Comment 1 Robin 2017-07-27 11:47:11 UTC

Created attachment 133069 [details]
Script output

Comment 2 Robin 2017-07-27 11:47:33 UTC

Created attachment 133070 [details]
kern.log

Comment 3 Robin 2017-07-27 12:00:20 UTC

What I noticed from the kern.log is that it seems to try and skip init steps the second time amdgpu loads. So perhaps the unbind doesn't do a clean enough shutdown or there may be a bug in the init step skipping.

For example, the first time:
> [  129.439652] amdgpu 0000:01:00.0: enabling device (0000 -> 0003)
> ...
> [  129.918128] [drm] GPU posting now...

The second time:
No mention of enabling device.
> [  159.722828] [drm] GPU post is not needed

Comment 4 Alex Deucher 2017-07-27 14:42:33 UTC

Created attachment 133074 [details] [review]
possible fix 1/2

Do the attached patches help (based on my drm-next-4.14-wip branch)?

Comment 5 Alex Deucher 2017-07-27 14:42:50 UTC

Created attachment 133075 [details] [review]
possible fix 2/2

Comment 6 Robin 2017-07-28 10:26:36 UTC

Thanks for the quick patches! I'm working my way to your kernel branch to rule out other changes fixing the issue. May take a little bit as I've not had to build my own kernels before.

Anyway, going from 4.10 to the 4.13rc2 kernel found here
http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc2/

Has the same problem, although slightly less reliably, and now including ring test errors.

With less reliably I mean, I've seen the driver *sometimes* working a 2nd binding, but give the same error on the 3rd time.

More results pending.

Comment 7 Robin 2017-07-28 11:46:03 UTC

Created attachment 133098 [details]
kern.log for drm-next-4.14-wip

Building the drm-next-4.14-wip branch including both patches does not resolve the issue and behaves similar to the previous 4.13rc2 kernel regarding ring test errors showing up. Typically 1, 9 and/or 10.

Comment 8 Alex Deucher 2017-07-28 14:24:25 UTC

Created attachment 133099 [details] [review]
possible fix 3/2

Dos using this patch on top of the other two help?

Comment 9 Robin 2017-07-28 15:19:14 UTC

Created attachment 133100 [details]
kern.log for drm-next-4.14-wip with patch 3

Same issue with patch3.

I've attached the kern.log of one of the occasions where it gave the init error on the 3rd time binding amdgpu, rather than the 2nd.

Comment 10 Robin 2017-07-28 15:37:01 UTC

Created attachment 133101 [details]
4.13rc2 ubuntu kern.log

Inspecting the output more closely there's a subtle difference in the error produced.

While the 4.10 kernel produces:

> [  160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed
> [  160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22

The 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 produce:

> [  134.226312] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
> [  134.226822] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22


And something I noticed for the third bind error cases, the 2nd and 3rd time have much longer ring 1 tests than the first bind.

> [   69.938959] [drm] ring test on 1 succeeded in 2 usecs
> ...
> [  102.040253] [drm] ring test on 1 succeeded in 677 usecs
> ...
> [  134.121468] [drm] ring test on 1 succeeded in 677 usecs

Comment 11 Alex Deucher 2017-07-28 15:53:24 UTC

Note that the GPU reset in patch 3/2 requires access to pci config registers for the GPU which many hypervisors block, so you'd need to make sure that works for the reset to work.

Comment 12 Robin 2017-07-28 17:26:43 UTC

Created attachment 133103 [details]
case2-rescan-amd.sh

In an attempt to make a second test case I've created a new script that produced some noteworthy results.

Rather than bind/unbind, this approach uses rmmod,modprobe, removing the pci device and rescanning to switch drivers.

Please excuse how poorly written and contrived the test case for "hotswapping" proposes, I'll try isolating what causes the differences with the first test case in some mutations next, but wanted to share the intermediate results as-is first.

Some details about this test.

The starting point is the same as the other test case, TTY and vfio-pci taking the card first. In order it will:

1. rmmod the current driver.
2. remove one pci subdevice (either VGA or Audio)
3. modprobe the new driver.
4. perform a pci rescan.

It will do this in a loop switching between amdgpu and vfio-pci again.

Another difference is that snd_hda_intel is in use elsewhere, it does not get an rmmod and will not switch back to vfio-pci because of this.

---

As for results, on 4.10 there was no change.
From the 2nd binding onward this error will fail to init the driver.
> [  160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed
> [  160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22

For 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 it's a different story.

They have an irregular pattern of errors every loop.
Either the 2nd or 3rd time the first error crops up. Typically this is:
> [  211.818341] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD)
> [  211.818725] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22

After that first error, additionally the following error can appear as well.
> [  247.626839] [drm:gfx_v7_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD)

And instead of ring 9, ring 10 may fail.
> [  356.686092] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (0xCAFEDEAD)
> [  356.686580] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22

They seem to randomly happen in the following combinations:

A. Ring 1 fails.
B. Ring 9 or 10 fails.
C. Ring 1 + Ring 9 or 10 fails.

Most importantly though. Only if 9 or 10 fail (B or C combinations) will the hw_init error occur. If it's just a ring 1 failure (A) the driver will successfully init the GPU.

Also, the drm-next-4.14-wip with patch 3 kernel will have this A combination and successful init a lot more often that the other two.

---

So my suspicion is that this difference could be due to:
- Repeatedly rmmodding and modprobing being part of the loop now.
- The rescanning method vs bind/unbind.
- The different treatment of the Audio component.
- The different access of vfio-pci to the Audio component.

So I will make several variations on the test scripts to try and narrow this down.

Comment 13 Robin 2017-07-28 17:29:42 UTC

(In reply to Alex Deucher from comment #11)
> Note that the GPU reset in patch 3/2 requires access to pci config registers
> for the GPU which many hypervisors block, so you'd need to make sure that
> works for the reset to work.

I'm not actually utilizing vfio-pci in these test cases, this runs as root from a TTY on the host machine. So I would assume it to work. I don't know how I would test this though, let me know how I could test this.

Comment 14 Robin 2017-07-28 20:55:59 UTC

Created attachment 133108 [details]
case3.sh

So, tinkering with the test script I've only been able to eliminate some suspicions and invalidate my observation the patch performed better.

I've taken out vfio-pci binding from the loop. It's only used during boot to keep the GPU free to unbind. So it's not related to vfio-pci's having access in between binds.

I've made 3 methods for rebinding amdgpu.
amdgpu rmmod > modprobe
remove pci devices > rescan
driver unbind > bind

I've run each of these a few dozen times on each kernel and none of them really stand out. All of them have a chance to work (as in, ring 1 test failure only) or to fail.

4.13rc2, drm-next-4.14-wip, drm-next-4.14-wip + patch3 all have this behaviour.
So no I don't think they've helped after all.

Comment 15 Alex Deucher 2017-07-28 21:06:58 UTC

Are you using a patched qemu that attempts to do radeon device specific gpu reset?  If so, does removing that code help?  Next, are you sure pci config access is allowed in your configuration?  As I mentioned in comment 11, it's required for gpu reset to work.

Comment 16 Robin 2017-07-28 21:17:57 UTC

(In reply to Alex Deucher from comment #15)
> Are you using a patched qemu that attempts to do radeon device specific gpu
> reset?  If so, does removing that code help?  Next, are you sure pci config
> access is allowed in your configuration?  As I mentioned in comment 11, it's
> required for gpu reset to work.

I have installed the ubuntu supplied version.
> $ kvm --version
> QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.3)
> Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers

But KVM/Qemu is not being invoked. After a fresh boot on bare metal, these are the results I get in a root TTY. 

I have seen mention of vfio-pci using device specific resets though.
https://www.spinics.net/lists/kvm/msg116277.html
So I will try to completely take it out of my test.

I'm not sure about pci config access, since I don't know how to verify this. Any instructions would be appreciated.

Comment 17 Robin 2017-07-28 21:55:45 UTC

I've tested:

- Disabling vfio-pci, no changes
- Disabling iommu support, no changes
- Booting with and without amdgpu blacklisted, no changes

Comment 18 Robin 2017-07-29 16:05:40 UTC

Created attachment 133127 [details] [review]
Logging shutdown function

I've modified the patch to include info messages. The code path is never executed in my tests.

Comment 19 Robin 2017-07-29 17:20:44 UTC

I've found that my test cases only trigger the PCI drivers'
amdgpu_pci_remove and amdgpu_pci_probe functions.

Adding the new shutdown function call amdgpu_device_shutdown(adev); to the amdgpu_pci_remove function does not resolve the issue.

Comment 20 Robin 2017-07-29 22:22:17 UTC

Created attachment 133132 [details] [review]
Brute-force fix, resets sdma every init

After much trial and error, I've found this approach to work.
Every hw_init both sDMA's will be flagged for a soft reset.

I have tried the existing soft reset code as well, but the busy status flags that are being used to selectively reset the sDMA's do not work reliably in my tests to prevent the errors.

Using this patch the 9 and 10 ring test errors no longer appear and prevents the ring 1 errors.

Comment 21 Luke A. Guest 2017-08-01 14:28:05 UTC

Created attachment 133172 [details]
Test of above with R9 380 with Windows 8.1 and latest AMD drivers

Hi,

After being asked to try this by Alex in IRC, I've added the output of the various logs, there will be overlap in places, dmesg and messages.

I'm running 4.13.0-rc2 with drm-next-4.14 branch merged and the set of 3 patches from Alex. Tried with and without the third patch, I still get a black screen on restarting the VM (using virt manager). The first boot from a freshly booted host starts fine.

In the log I've put "START" and "RESTART" where the VM is started (then shutdown) and then restarted. There is also extra PCI debugging messages enabled in the kernel.

I too, would like an answer re the probing mentioned in 11.

Comment 22 Luke A. Guest 2017-08-01 14:59:21 UTC

I'd like to report a minor success. I've managed to boot into win8.1 twice in a row. I booted as normal through virt-manager, then shutdown from inside the guest, then called a script:

#!/bin/sh

echo 1 > /sys/bus/pci/devices/0000\:03\:00.0/remove
echo 1 > /sys/bus/pci/devices/0000\:03\:00.1/remove
echo 1 > /sys/bus/pci/rescan

/opt/vfio/rebind_dev_to_vfio.sh 0000:03:00.0
/opt/vfio/rebind_dev_to_vfio.sh 0000:03:00.1

Then restarted the guest from virt-manager, booted fine, again I shutdown from within the guest.

On running the above script a second time, the machine hung, hard. I couldn't login through serial, sysrq keys didn't do anything.

Comment 23 Luke A. Guest 2017-08-01 15:16:34 UTC

Created attachment 133173 [details]
Script to rebind a device back to the vfio-pci driver

forgot to submit this.

Comment 24 Robin 2017-08-01 16:16:55 UTC

(In reply to Luke A. Guest from comment #22)
> I'd like to report a minor success. I've managed to boot into win8.1 twice
> in a row. I booted as normal through virt-manager, then shutdown from inside
> the guest, then called a script:

Hi Luke, few questions. When booting the host, do you boot with amdgpu or vfio-pci bound to the GPU? After you've started a VM, did you bind back to amdgpu or did you stay on vfio-pci?

Is it during vfio or amdgpu control that your system hangs on the second boot?

If it's during amdgpu, have you tried my patch from comment 20?

Comment 25 Luke A. Guest 2017-08-01 16:30:55 UTC

> Hi Luke, few questions. When booting the host, do you boot with amdgpu or vfio-pci bound to the GPU? After you've started a VM, did you bind back to amdgpu or did you stay on vfio-pci?

I have 2 AMD GPU's, R9 390 (host) and R9 380 (guest). I boot with the 380 being passed over to vfio-pci. On exit the VM sets the 380 back to vfio-pci.

> Is it during vfio or amdgpu control that your system hangs on the second boot?

It was during a boot of the VM, the devices were attached to the vfio-pci driver.

> If it's during amdgpu, have you tried my patch from comment 20?

I haven't tried it, I don't think it would apply to my card as it's VI not CIK. Although, if I were using the 390 (CIK) it likely would. The issues are similar though and I believe I've just proved that the so called hw reset bug, in may case anyway, is sw not hw.

Comment 26 Robin 2017-08-01 16:58:07 UTC

(In reply to Luke A. Guest from comment #25)
> I have 2 AMD GPU's, R9 390 (host) and R9 380 (guest). I boot with the 380
> being passed over to vfio-pci. On exit the VM sets the 380 back to vfio-pci.

FWIW I don't think any of these patches are relevant to you then.
The reset logic for your 380 would be coming from the guest's driver + vfio-pci. Where vfio in theory should only try to get the 380's state back to how it would be if you actually rebooted and leave the more sophisticated work to the guest driver.

Though as mentioned here https://www.spinics.net/lists/kvm/msg116277.html vfio-pci may employ hardware specific solutions if there's no good blanket solution.

---

For my scenario I have the Intel iGPU and the R9 290. So I am trying to find a setup where I can use the 290 for gaming on both the host and the guest. Once the 290 is bound to vfio-pci I have no issues with the VM. Reboot, force off, as many times as I like and no problems.

It's when I am done with the VMs and try to give the 290 back to amdgpu I had init issues. Which my comment 20 patch does resolve, even if it is a carpet bomb approach to solving it.

Comment 27 Luke A. Guest 2017-08-01 17:47:32 UTC

> FWIW I don't think any of these patches are relevant to you then.

Not strictly true. As I said, Alex pointed me at this page to try his patches. I believe all this is connected. There are issues un/binding from/to the driver. There are reset issues as well. 

I've put my test branch here https://github.com/Lucretia/linux-amdgpu/tree/amdgpu/v4.13-rc2-amdgpu-reset-test

Comment 28 Robin 2017-08-02 06:37:08 UTC

(In reply to Luke A. Guest from comment #27)
> > FWIW I don't think any of these patches are relevant to you then.
> 
> Not strictly true. As I said, Alex pointed me at this page to try his
> patches. I believe all this is connected. There are issues un/binding
> from/to the driver. There are reset issues as well. 

True, there may be init/cleanup issues with AMD cards that might be better understood and documented in the open source community if this were fixed in amdgpu and hopefully that helps you in vfio-pci as well.

Comment 29 Robin 2017-08-30 20:55:45 UTC

As I will pass on my R9 290 and switch to an RX 580, please let me know if you need any extra information from either card before I no longer have the R9 290.

Comment 30 Martin Peres 2019-11-19 08:20:38 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/207.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.