106111 – [GPU Passthrough]GPU (Polaris) not reinitialized with Linux VM (Reset bug)

Bug 106111 - [GPU Passthrough]GPU (Polaris) not reinitialized with Linux VM (Reset bug)

Summary: [GPU Passthrough]GPU (Polaris) not reinitialized with Linux VM (Reset bug)

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/AMDgpu (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) All

Importance:	medium normal
Assignee:	Default DRI bug account
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2018-04-17 22:29 UTC by Maxime
Modified:	2019-11-19 08:35 UTC (History)
CC List:	4 users (show)

See Also:
i915 platform:
i915 features:

Attachments
xorg.conf (152 bytes, text/plain) 2018-04-17 22:29 UTC, Maxime	no flags	Details
dmesg output after to launch the VM a second time (213.25 KB, text/plain) 2018-04-17 22:30 UTC, Maxime	no flags	Details
dmesg after second launch + 4.17-rc1 (77.89 KB, text/plain) 2018-04-18 06:11 UTC, Maxime	no flags	Details
View All

Description Maxime 2018-04-17 22:29:27 UTC

Created attachment 138887 [details]
xorg.conf

Hi,

My Setup :
- AMD Ryzen 1600
- 16 Gb Memory RAM
- Host (Debian Stable, kernel 4.16.2) : AMD Rx560 4Gb
- Guest (Windows 10 / Archlinux Kernel 4.15.x-4.16.x) : AMD Rx580 - 8Gb

Years ago there was an issue on Windows virtual machine with Qemu/VFIO and AMD GPU. It was impossible to reboot or use a 2nde time the Guest because the GPU was not reinitialized when the Host was shutdown. The only solution to re-use the VM was to reboot the Host OR use a Nvidia GPU.

Actually, the issue is fixed on Windows VM + AMD GPU passed through (i don't know how), i can use more times my VM without reboot the Host. 

But if i use my Linux VM with my Rx580, the issue still exist. The first launch works, i can use the Rx580 to play without problem. But if i shutdown/reboot the guest, the Rx580 is "blocked". I need to hard reboot because the system hangs after ~2-3 minutes.

Thanks for your help,
Maxime 

(Sorry for my English, i'm French)

Comment 1 Maxime 2018-04-17 22:30:13 UTC

Created attachment 138888 [details]
dmesg output after to launch the VM a second time

Comment 2 Alex Williamson 2018-04-17 22:55:23 UTC

The IOMMU looks to be unhappy first:

[   40.201258] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
[   40.201271] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
[   40.201279] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370
[  159.958402] AMD-Vi: Completion-Wait loop timed out
[  160.118777] AMD-Vi: Completion-Wait loop timed out
[  160.799864] AMD-Vi: Event logged [
[  160.799868] IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8e8550]
[  160.799872] AMD-Vi: Event logged [
[  160.799874] IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8e8570]
[  160.799876] AMD-Vi: Event logged [
[  160.799878] IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8e8590]
[  161.801729] AMD-Vi: Event logged [
[  161.801732] IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8e85e0]
[  180.096365] AMD-Vi: Completion-Wait loop timed out
[  180.256758] AMD-Vi: Completion-Wait loop timed out
[  180.417182] AMD-Vi: Completion-Wait loop timed out
[  180.577636] AMD-Vi: Completion-Wait loop timed out

Can you try a v4.17-rc1 kernel?  Specifically, these two updates:

6bd06f5a486c vfio/type1: Adopt fast IOTLB flush interface when unmap IOVAs
eb5ecd1a40e2 iommu/amd: Add support for fast IOTLB flushing

Something about AMD GPUs get unhappy if the IOMMU sends out too many invalidations and the above two patches can reduce the number of those invalidations by up to a factor of 512.

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=6bd06f5a486c06023a618a86e8153b91d26f75f4
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=eb5ecd1a40e2098f805fb63cb07817ac48826e40

Comment 3 Maxime 2018-04-18 06:11:21 UTC

Created attachment 138893 [details]
dmesg after second launch + 4.17-rc1

Same problem with the Kernel 4.17-rc1. To be sure, i need to install this kernel only on the Host, no need to install it on the Linux Guest ?

I use my own kernel 4.17 so maybe IOMMU/VFIO options are missing :

odelpasso@debian-desktop:~/Bureau$ cat /boot/config-4.17.0-rc1 | grep VFIO
CONFIG_VFIO_IOMMU_TYPE1=m
CONFIG_VFIO_VIRQFD=m
CONFIG_VFIO=m
# CONFIG_VFIO_NOIOMMU is not set
CONFIG_VFIO_PCI=m
CONFIG_VFIO_PCI_VGA=y
CONFIG_VFIO_PCI_MMAP=y
CONFIG_VFIO_PCI_INTX=y
CONFIG_VFIO_PCI_IGD=y
# CONFIG_VFIO_MDEV is not set
CONFIG_KVM_VFIO=y

odelpasso@debian-desktop:~/Bureau$ cat /boot/config-4.17.0-rc1 | grep IOMMU
# CONFIG_GART_IOMMU is not set
# CONFIG_CALGARY_IOMMU is not set
CONFIG_IOMMU_HELPER=y
CONFIG_VFIO_IOMMU_TYPE1=m
# CONFIG_VFIO_NOIOMMU is not set
CONFIG_IOMMU_API=y
CONFIG_IOMMU_SUPPORT=y
# Generic IOMMU Pagetable Support
CONFIG_IOMMU_IOVA=y
CONFIG_AMD_IOMMU=y
CONFIG_AMD_IOMMU_V2=y
# CONFIG_INTEL_IOMMU is not set

Comment 4 Alex Williamson 2018-04-18 16:13:52 UTC

There is a difference, now we have:

[   84.997634] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
[   84.997645] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
[   84.997653] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370
[  145.518307] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
[  145.518313] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
[  145.518318] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370

So prior to time 145.5 the VM was shutdown and started again and we could still read config space of the device.  Previously we were already getting IOMMU faults before the second startup.  But shortly after:

[  193.328586] AMD-Vi: Completion-Wait loop timed out
[  193.488711] AMD-Vi: Completion-Wait loop timed out
[  194.169913] iommu ivhd0: AMD-Vi: Event logged [
[  194.169921] iommu ivhd0: IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8aaca0]
[  194.169924] iommu ivhd0: AMD-Vi: Event logged [
[  194.169928] iommu ivhd0: IOTLB_INV_TIMEOUT device=0a:00.0 address=0x000000043e8aacc0]

And the stuck in D3 state is evidence that the device is no longer accessible on the bus.  So that only delayed the issue, some interaction between the IOMMU and GPU is still failing.

Comment 5 Maxime 2018-04-18 16:33:36 UTC

(In reply to Alex Williamson from comment #4)
> There is a difference, now we have:
> 
> [   84.997634] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
> [   84.997645] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
> [   84.997653] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370
> [  145.518307] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x19@0x270
> [  145.518313] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1b@0x2d0
> [  145.518318] vfio_ecap_init: 0000:0a:00.0 hiding ecap 0x1e@0x370
> 
> So prior to time 145.5 the VM was shutdown and started again and we could
> still read config space of the device.  Previously we were already getting
> IOMMU faults before the second startup.  But shortly after:
> 
> [  193.328586] AMD-Vi: Completion-Wait loop timed out
> [  193.488711] AMD-Vi: Completion-Wait loop timed out
> [  194.169913] iommu ivhd0: AMD-Vi: Event logged [
> [  194.169921] iommu ivhd0: IOTLB_INV_TIMEOUT device=0a:00.0
> address=0x000000043e8aaca0]
> [  194.169924] iommu ivhd0: AMD-Vi: Event logged [
> [  194.169928] iommu ivhd0: IOTLB_INV_TIMEOUT device=0a:00.0
> address=0x000000043e8aacc0]
> 
> And the stuck in D3 state is evidence that the device is no longer
> accessible on the bus.  So that only delayed the issue, some interaction
> between the IOMMU and GPU is still failing.

Thanks for the explaination Alex.
Something could be done ? 
By AMD or VFIO mainteners ?

Comment 6 Radosław Szkodziński 2018-08-18 08:31:32 UTC

This is still happening. It seems that these GPU need engine resets before bus reset, similar to what was done for Fury and Polaris, but more extensive.

Temporary workaround (yeah sure) is to eject the driver - rmmod in guest or eject in Windows. This resets the engines.

Windows did the resets on shutdown until version 18.5.1 where they broke shutdown sequence again - read release notes on Radeon Pro Vega FE drivers where they actually slightly care.

Comment 7 Andrew Sheldon 2018-09-14 10:30:30 UTC

Another workaround that has worked for me with a Vega 56 is to suspend-to-ram the host system before trying to start the guest again.

Comment 8 Martin Peres 2019-11-19 08:35:24 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/amd/issues/346.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.