Summary: | Rebinding AMDGPU causes initialization errors [R9 290] | ||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Robin <beanow> | ||||||||||||||||||||||||||||||||
Component: | DRM/AMDgpu | Assignee: | Default DRI bug account <dri-devel> | ||||||||||||||||||||||||||||||||
Status: | RESOLVED MOVED | QA Contact: | |||||||||||||||||||||||||||||||||
Severity: | normal | ||||||||||||||||||||||||||||||||||
Priority: | medium | CC: | beanow, laguest | ||||||||||||||||||||||||||||||||
Version: | unspecified | ||||||||||||||||||||||||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||||||||||||
OS: | Linux (All) | ||||||||||||||||||||||||||||||||||
See Also: | https://bugs.freedesktop.org/show_bug.cgi?id=111229 | ||||||||||||||||||||||||||||||||||
Whiteboard: | |||||||||||||||||||||||||||||||||||
i915 platform: | i915 features: | ||||||||||||||||||||||||||||||||||
Attachments: |
|
Description
Robin
2017-07-27 11:46:28 UTC
Created attachment 133069 [details]
Script output
Created attachment 133070 [details]
kern.log
What I noticed from the kern.log is that it seems to try and skip init steps the second time amdgpu loads. So perhaps the unbind doesn't do a clean enough shutdown or there may be a bug in the init step skipping. For example, the first time: > [ 129.439652] amdgpu 0000:01:00.0: enabling device (0000 -> 0003) > ... > [ 129.918128] [drm] GPU posting now... The second time: No mention of enabling device. > [ 159.722828] [drm] GPU post is not needed Created attachment 133074 [details] [review] possible fix 1/2 Do the attached patches help (based on my drm-next-4.14-wip branch)? Created attachment 133075 [details] [review] possible fix 2/2 Thanks for the quick patches! I'm working my way to your kernel branch to rule out other changes fixing the issue. May take a little bit as I've not had to build my own kernels before. Anyway, going from 4.10 to the 4.13rc2 kernel found here http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.13-rc2/ Has the same problem, although slightly less reliably, and now including ring test errors. With less reliably I mean, I've seen the driver *sometimes* working a 2nd binding, but give the same error on the 3rd time. More results pending. Created attachment 133098 [details]
kern.log for drm-next-4.14-wip
Building the drm-next-4.14-wip branch including both patches does not resolve the issue and behaves similar to the previous 4.13rc2 kernel regarding ring test errors showing up. Typically 1, 9 and/or 10.
Created attachment 133099 [details] [review] possible fix 3/2 Dos using this patch on top of the other two help? Created attachment 133100 [details]
kern.log for drm-next-4.14-wip with patch 3
Same issue with patch3.
I've attached the kern.log of one of the occasions where it gave the init error on the 3rd time binding amdgpu, rather than the 2nd.
Created attachment 133101 [details] 4.13rc2 ubuntu kern.log Inspecting the output more closely there's a subtle difference in the error produced. While the 4.10 kernel produces: > [ 160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed > [ 160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22 The 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 produce: > [ 134.226312] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD) > [ 134.226822] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22 And something I noticed for the third bind error cases, the 2nd and 3rd time have much longer ring 1 tests than the first bind. > [ 69.938959] [drm] ring test on 1 succeeded in 2 usecs > ... > [ 102.040253] [drm] ring test on 1 succeeded in 677 usecs > ... > [ 134.121468] [drm] ring test on 1 succeeded in 677 usecs Note that the GPU reset in patch 3/2 requires access to pci config registers for the GPU which many hypervisors block, so you'd need to make sure that works for the reset to work. Created attachment 133103 [details] case2-rescan-amd.sh In an attempt to make a second test case I've created a new script that produced some noteworthy results. Rather than bind/unbind, this approach uses rmmod,modprobe, removing the pci device and rescanning to switch drivers. Please excuse how poorly written and contrived the test case for "hotswapping" proposes, I'll try isolating what causes the differences with the first test case in some mutations next, but wanted to share the intermediate results as-is first. Some details about this test. The starting point is the same as the other test case, TTY and vfio-pci taking the card first. In order it will: 1. rmmod the current driver. 2. remove one pci subdevice (either VGA or Audio) 3. modprobe the new driver. 4. perform a pci rescan. It will do this in a loop switching between amdgpu and vfio-pci again. Another difference is that snd_hda_intel is in use elsewhere, it does not get an rmmod and will not switch back to vfio-pci because of this. --- As for results, on 4.10 there was no change. From the 2nd binding onward this error will fail to init the driver. > [ 160.013733] [drm:ci_dpm_enable [amdgpu]] *ERROR* ci_start_dpm failed > [ 160.014134] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <amdgpu_powerplay> failed -22 For 4.13rc2, drm-next-4.14-wip and drm-next-4.14-wip with patch 3 it's a different story. They have an irregular pattern of errors every loop. Either the 2nd or 3rd time the first error crops up. Typically this is: > [ 211.818341] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 9 test failed (0xCAFEDEAD) > [ 211.818725] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22 After that first error, additionally the following error can appear as well. > [ 247.626839] [drm:gfx_v7_0_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 1 test failed (scratch(0xC040)=0xCAFEDEAD) And instead of ring 9, ring 10 may fail. > [ 356.686092] [drm:cik_sdma_ring_test_ring [amdgpu]] *ERROR* amdgpu: ring 10 test failed (0xCAFEDEAD) > [ 356.686580] [drm:amdgpu_device_init [amdgpu]] *ERROR* hw_init of IP block <cik_sdma> failed -22 They seem to randomly happen in the following combinations: A. Ring 1 fails. B. Ring 9 or 10 fails. C. Ring 1 + Ring 9 or 10 fails. Most importantly though. Only if 9 or 10 fail (B or C combinations) will the hw_init error occur. If it's just a ring 1 failure (A) the driver will successfully init the GPU. Also, the drm-next-4.14-wip with patch 3 kernel will have this A combination and successful init a lot more often that the other two. --- So my suspicion is that this difference could be due to: - Repeatedly rmmodding and modprobing being part of the loop now. - The rescanning method vs bind/unbind. - The different treatment of the Audio component. - The different access of vfio-pci to the Audio component. So I will make several variations on the test scripts to try and narrow this down. (In reply to Alex Deucher from comment #11) > Note that the GPU reset in patch 3/2 requires access to pci config registers > for the GPU which many hypervisors block, so you'd need to make sure that > works for the reset to work. I'm not actually utilizing vfio-pci in these test cases, this runs as root from a TTY on the host machine. So I would assume it to work. I don't know how I would test this though, let me know how I could test this. Created attachment 133108 [details]
case3.sh
So, tinkering with the test script I've only been able to eliminate some suspicions and invalidate my observation the patch performed better.
I've taken out vfio-pci binding from the loop. It's only used during boot to keep the GPU free to unbind. So it's not related to vfio-pci's having access in between binds.
I've made 3 methods for rebinding amdgpu.
amdgpu rmmod > modprobe
remove pci devices > rescan
driver unbind > bind
I've run each of these a few dozen times on each kernel and none of them really stand out. All of them have a chance to work (as in, ring 1 test failure only) or to fail.
4.13rc2, drm-next-4.14-wip, drm-next-4.14-wip + patch3 all have this behaviour.
So no I don't think they've helped after all.
Are you using a patched qemu that attempts to do radeon device specific gpu reset? If so, does removing that code help? Next, are you sure pci config access is allowed in your configuration? As I mentioned in comment 11, it's required for gpu reset to work. (In reply to Alex Deucher from comment #15) > Are you using a patched qemu that attempts to do radeon device specific gpu > reset? If so, does removing that code help? Next, are you sure pci config > access is allowed in your configuration? As I mentioned in comment 11, it's > required for gpu reset to work. I have installed the ubuntu supplied version. > $ kvm --version > QEMU emulator version 2.8.0(Debian 1:2.8+dfsg-3ubuntu2.3) > Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers But KVM/Qemu is not being invoked. After a fresh boot on bare metal, these are the results I get in a root TTY. I have seen mention of vfio-pci using device specific resets though. https://www.spinics.net/lists/kvm/msg116277.html So I will try to completely take it out of my test. I'm not sure about pci config access, since I don't know how to verify this. Any instructions would be appreciated. I've tested: - Disabling vfio-pci, no changes - Disabling iommu support, no changes - Booting with and without amdgpu blacklisted, no changes Created attachment 133127 [details] [review] Logging shutdown function I've modified the patch to include info messages. The code path is never executed in my tests. I've found that my test cases only trigger the PCI drivers' amdgpu_pci_remove and amdgpu_pci_probe functions. Adding the new shutdown function call amdgpu_device_shutdown(adev); to the amdgpu_pci_remove function does not resolve the issue. Created attachment 133132 [details] [review] Brute-force fix, resets sdma every init After much trial and error, I've found this approach to work. Every hw_init both sDMA's will be flagged for a soft reset. I have tried the existing soft reset code as well, but the busy status flags that are being used to selectively reset the sDMA's do not work reliably in my tests to prevent the errors. Using this patch the 9 and 10 ring test errors no longer appear and prevents the ring 1 errors. Created attachment 133172 [details]
Test of above with R9 380 with Windows 8.1 and latest AMD drivers
Hi,
After being asked to try this by Alex in IRC, I've added the output of the various logs, there will be overlap in places, dmesg and messages.
I'm running 4.13.0-rc2 with drm-next-4.14 branch merged and the set of 3 patches from Alex. Tried with and without the third patch, I still get a black screen on restarting the VM (using virt manager). The first boot from a freshly booted host starts fine.
In the log I've put "START" and "RESTART" where the VM is started (then shutdown) and then restarted. There is also extra PCI debugging messages enabled in the kernel.
I too, would like an answer re the probing mentioned in 11.
I'd like to report a minor success. I've managed to boot into win8.1 twice in a row. I booted as normal through virt-manager, then shutdown from inside the guest, then called a script: #!/bin/sh echo 1 > /sys/bus/pci/devices/0000\:03\:00.0/remove echo 1 > /sys/bus/pci/devices/0000\:03\:00.1/remove echo 1 > /sys/bus/pci/rescan /opt/vfio/rebind_dev_to_vfio.sh 0000:03:00.0 /opt/vfio/rebind_dev_to_vfio.sh 0000:03:00.1 Then restarted the guest from virt-manager, booted fine, again I shutdown from within the guest. On running the above script a second time, the machine hung, hard. I couldn't login through serial, sysrq keys didn't do anything. Created attachment 133173 [details]
Script to rebind a device back to the vfio-pci driver
forgot to submit this.
(In reply to Luke A. Guest from comment #22) > I'd like to report a minor success. I've managed to boot into win8.1 twice > in a row. I booted as normal through virt-manager, then shutdown from inside > the guest, then called a script: Hi Luke, few questions. When booting the host, do you boot with amdgpu or vfio-pci bound to the GPU? After you've started a VM, did you bind back to amdgpu or did you stay on vfio-pci? Is it during vfio or amdgpu control that your system hangs on the second boot? If it's during amdgpu, have you tried my patch from comment 20? > Hi Luke, few questions. When booting the host, do you boot with amdgpu or vfio-pci bound to the GPU? After you've started a VM, did you bind back to amdgpu or did you stay on vfio-pci? I have 2 AMD GPU's, R9 390 (host) and R9 380 (guest). I boot with the 380 being passed over to vfio-pci. On exit the VM sets the 380 back to vfio-pci. > Is it during vfio or amdgpu control that your system hangs on the second boot? It was during a boot of the VM, the devices were attached to the vfio-pci driver. > If it's during amdgpu, have you tried my patch from comment 20? I haven't tried it, I don't think it would apply to my card as it's VI not CIK. Although, if I were using the 390 (CIK) it likely would. The issues are similar though and I believe I've just proved that the so called hw reset bug, in may case anyway, is sw not hw. (In reply to Luke A. Guest from comment #25) > I have 2 AMD GPU's, R9 390 (host) and R9 380 (guest). I boot with the 380 > being passed over to vfio-pci. On exit the VM sets the 380 back to vfio-pci. FWIW I don't think any of these patches are relevant to you then. The reset logic for your 380 would be coming from the guest's driver + vfio-pci. Where vfio in theory should only try to get the 380's state back to how it would be if you actually rebooted and leave the more sophisticated work to the guest driver. Though as mentioned here https://www.spinics.net/lists/kvm/msg116277.html vfio-pci may employ hardware specific solutions if there's no good blanket solution. --- For my scenario I have the Intel iGPU and the R9 290. So I am trying to find a setup where I can use the 290 for gaming on both the host and the guest. Once the 290 is bound to vfio-pci I have no issues with the VM. Reboot, force off, as many times as I like and no problems. It's when I am done with the VMs and try to give the 290 back to amdgpu I had init issues. Which my comment 20 patch does resolve, even if it is a carpet bomb approach to solving it. > FWIW I don't think any of these patches are relevant to you then. Not strictly true. As I said, Alex pointed me at this page to try his patches. I believe all this is connected. There are issues un/binding from/to the driver. There are reset issues as well. I've put my test branch here https://github.com/Lucretia/linux-amdgpu/tree/amdgpu/v4.13-rc2-amdgpu-reset-test (In reply to Luke A. Guest from comment #27) > > FWIW I don't think any of these patches are relevant to you then. > > Not strictly true. As I said, Alex pointed me at this page to try his > patches. I believe all this is connected. There are issues un/binding > from/to the driver. There are reset issues as well. True, there may be init/cleanup issues with AMD cards that might be better understood and documented in the open source community if this were fixed in amdgpu and hopefully that helps you in vfio-pci as well. As I will pass on my R9 290 and switch to an RX 580, please let me know if you need any extra information from either card before I no longer have the R9 290. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/207. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.