110238 – Crashes when using MDEV passthrough on i915

Bug 110238 - Crashes when using MDEV passthrough on i915

Summary: Crashes when using MDEV passthrough on i915

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/iGVT-g (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	medium normal
Assignee:	Terrence Xu
QA Contact:	Terrence Xu

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-25 14:53 UTC by Christian Ehrhardt
Modified:	2019-04-10 08:58 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
GPU Crash dump as triggered by the bug (2.38 MB, text/plain) 2019-03-25 14:56 UTC, Christian Ehrhardt	no flags	Details
View All

Description Christian Ehrhardt 2019-03-25 14:53:02 UTC

Hi,
I was using MDEV passthrough with KVMGT

Enabled on kernel commandline via /etc/default/grub:
  i915.enable_gvt=1 intel_iommu=on drm.debug=0
And loading the modules:
  $ printf "kvmgt\nvfio-iommu-type1\nvfio-mdev" | sudo tee /etc/initramfs-tools/modules

Update and reboot
 $ sudo update-initramfs -u                                                         
 $ sudo update-grub 

Then I was creating a UUID for the MDEV
 $ cd /sys/bus/pci/devices/0000:00:02.0/mdev_supported_types/i915-GVTg_V4_4
 $ echo 4dd50f26-ec08-11e8-b838-4bc3356865b6 | sudo tee create

Finally I was telling libvirt to use that modifying my guest XML like
 <graphics type='spice'>                                                          
   <listen type='none'/>                                                          
   <gl enable='yes'/>                                                             
 </graphics>                                                                      
 <hostdev mode='subsystem' type='mdev' managed='no' model='vfio-pci'>             
   <source>                                                                       
     <address uuid='4dd50f26-ec08-11e8-b838-4bc3356865b6'/>                       
   </source>                                                                      
 </hostdev>


The pass-through worked and the guest seemed happy for a while.
But later on I realized my guest got stuck and on the Host I found this in dmesg:

[  230.274856] DMAR: DRHD: handling fault status reg 3
[  230.274923] DMAR: [DMA Write] Request device [00:02.0] fault addr fff94000 [fault reason 23] Unknown
[  230.274985] DMAR: DRHD: handling fault status reg 2
[  230.275021] DMAR: [DMA Write] Request device [00:02.0] fault addr 30000 [fault reason 23] Unknown
[  230.275080] DMAR: DRHD: handling fault status reg 2
[  230.275117] DMAR: [DMA Write] Request device [00:02.0] fault addr 55000 [fault reason 23] Unknown
[  230.275179] DMAR: DRHD: handling fault status reg 3
[  235.276444] dmar_fault: 5440889 callbacks suppressed
[  235.276445] DMAR: DRHD: handling fault status reg 3
[  235.276484] DMAR: [DMA Write] Request device [00:02.0] fault addr 2fe93c000 [fault reason 23] Unknown
[  235.276518] DMAR: DRHD: handling fault status reg 2
[  235.276539] DMAR: [DMA Write] Request device [00:02.0] fault addr 2fe96e000 [fault reason 23] Unknown
[  235.276571] DMAR: DRHD: handling fault status reg 2
[  235.276592] DMAR: [DMA Write] Request device [00:02.0] fault addr 2fe994000 [fault reason 23] Unknown
[  235.276625] DMAR: DRHD: handling fault status reg 2
[  240.280429] dmar_fault: 6145791 callbacks suppressed
[  240.280431] DMAR: DRHD: handling fault status reg 3
[  240.280463] DMAR: [DMA Write] Request device [00:02.0] fault addr 5e5db8000 [fault reason 23] Unknown
[  240.280511] DMAR: DRHD: handling fault status reg 3
[  240.280554] DMAR: [DMA Write] Request device [00:02.0] fault addr 5e5dec000 [fault reason 23] Unknown
[  240.280623] DMAR: DRHD: handling fault status reg 3
[  240.280662] DMAR: [DMA Write] Request device [00:02.0] fault addr 5e5e34000 [fault reason 23] Unknown
[  240.280733] DMAR: DRHD: handling fault status reg 3
[  245.284441] dmar_fault: 5699149 callbacks suppressed
[  245.284442] DMAR: DRHD: handling fault status reg 2
[  245.284480] DMAR: [DMA Write] Request device [00:02.0] fault addr 8c90fb000 [fault reason 23] Unknown
[  245.284511] DMAR: DRHD: handling fault status reg 2
[  245.284530] DMAR: [DMA Write] Request device [00:02.0] fault addr 8c9128000 [fault reason 23] Unknown
[  245.284560] DMAR: DRHD: handling fault status reg 2
[  245.284579] DMAR: [DMA Write] Request device [00:02.0] fault addr 8c914a000 [fault reason 23] Unknown
[  245.284610] DMAR: DRHD: handling fault status reg 2
[  250.106273] [drm] GPU HANG: ecode 8:0:0xe757fefe, reason: no progress on rcs0, action: reset
[  250.106274] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  250.106275] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  250.106276] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  250.106276] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  250.106277] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  250.106299] i915 0000:00:02.0: Resetting rcs0 for no progress on rcs0
[  251.900704] i915 0000:00:02.0: Resetting chip for no progress on rcs0
[  251.900718] i915 0000:00:02.0: GPU recovery failed


Unfortunately /sys/class/drm/card0/error is empty, so not a lot to report.
But OTOH it seems reproducible rather easily.

I have beignet installed on the guest and the following sequence seems to trigger the issues:
1. starting the guest
2. run clinfo in the guest (see i915 would be available)
3. wait ~60 seconds

At some point in these 60 seconds it will crash.
I don't know yet if the "clinfo" is required or just a red herring, not much else.
But since it seems reproducible please just ask what you'd need in addition and I'll try to create the data needed.

HW Info:
CPU: Intel(R) Core(TM) i5-5250U CPU @ 1.60GHz
$ lspci -v -s 00:02.0
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 6000 (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Intel Corporation HD Graphics 6000
        Flags: bus master, fast devsel, latency 0, IRQ 48
        Memory at f6000000 (64-bit, non-prefetchable) [size=16M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at f000 [size=64]
        [virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
        Capabilities: <access denied>
        Kernel driver in use: i915
        Kernel modules: i915


SW Info:
Ubuntu running latest release with kernel 5.0.0-8-generic.
For the MDEV passthrough Libvirt 5.0 and Qemu 3.1.

The Host was initially still running a Ubuntu Desktop on the very same graphic card - so some arbitration might as well have been the issue. But I had it boot into text mode only (no UI stack initialized) and it was triggering the same bug.


Let me know what you'd need to get this debugged further (e.g. a pointer how to better enable gpu crash dumps?).

Comment 1 Christian Ehrhardt 2019-03-25 14:55:25 UTC

The GPU crash was faking size zero, it is actually readable. Attaching it as a file ...

Comment 2 Christian Ehrhardt 2019-03-25 14:56:13 UTC

Created attachment 143775 [details]
GPU Crash dump as triggered by the bug

Comment 3 Terrence Xu 2019-04-04 13:21:54 UTC

Can you try to use "intel_iommu=igfx_off" in grub for a try, now there some DMAR issue in the latest kernel.

Comment 4 Christian Ehrhardt 2019-04-10 08:58:31 UTC

Hi,
thanks for your feedback.

I was experimenting more on the case actually and I found that sometimes it just doesn not show up even after quite some time. But there was a way to "force" the issue (which is on the host) from the guest.
To do so install opencl bits and try to use it, that would be:
 $ sudo apt install ocl-icd-libopencl1
 $ sudo apt install opencl-headers
 $ sudo apt install clinfo
 $ sudo apt install beignet
And then run clinfo
 $ clinfo
So far this triggered the bug immediately in all my tests.


With that set up to ensure I can actively trigger it I tried the suggested intel_iommu=igfx_off.

That works like a charm in my setup - thanks!
With intel_iommu=igfx_off I can even run accelerated opencl load in the guest on the mdev.

But since [1] states about that "If this fixes anything, please ensure you file a bug reporting the problem." I think I might need to file a bug in a different place - maybe for iommu itself then.

Thanks - we can close this bug and I'll look for a place to file that iommu bug.

[1]: https://www.kernel.org/doc/Documentation/Intel-IOMMU.txt

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.