Bug 102376 - iGPU hangs in KVM guest with GPU passthrough
Summary: iGPU hangs in KVM guest with GPU passthrough
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-08-23 17:53 UTC by ebrombugs
Modified: 2018-04-25 06:58 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
iGPU crash dump (/sys/class/drm/card0/error) (755.78 KB, text/plain)
2017-08-23 17:53 UTC, ebrombugs
no flags Details
dmesg output after the bug is triggered (35.90 KB, text/plain)
2017-08-23 17:55 UTC, ebrombugs
no flags Details
4.12 guest dmesg (35.27 KB, text/plain)
2017-08-23 20:40 UTC, ebrombugs
no flags Details
I was wrong, just got a GPU hang with only the iGPU passed through on 4.9 (no dGPU/PRIME) (41.70 KB, text/plain)
2017-08-24 13:54 UTC, ebrombugs
no flags Details
crash dump from hang w/o dGPU passed through (759.22 KB, text/plain)
2017-08-24 14:46 UTC, ebrombugs
no flags Details

Description ebrombugs 2017-08-23 17:53:13 UTC
Created attachment 133720 [details]
iGPU crash dump (/sys/class/drm/card0/error)

Overview: The iGPU hangs semi-randomly in a KVM virtual machine running Debian 9, with an Intel HD 530 passed through to the guest at 00:02.0 and a GTX 960M (Optimus device, no outputs of its own) passed through to the guest at 00:04.0 with the appropriate rom file (that nouveau successfully loads) in the default PRIME setup. 

Steps to reproduce: It's mostly random, but executing this command usually does the trick:

1. Run
vblank_mode=0 DRI_PRIME=1 glxgears
in the aforementioned environment.

Actual results: The iGPU hangs, rendering the screen completely unresponsive, except for the cursor which can sometimes be moved around, and is stuck in the state (the image displayed) it was in at the moment the iGPU hanged. It sometimes recovers, sometimes stays frozen, and sometimes (happened exactly once to me) goes black and then displays a static, garbled (appears random) image. 

Expected results: The iGPU should continuously stay responsive, and provide an output for the Optimus-enabled dGPU.

Build Dates: 
i915 1.6.0 20160919
nouveau 1.3.1 20120801
Debian 4.9.30-2+deb9u3 (2017-08-06)
X.Org X Server 1.19.2

Additional Information: 
Output of lspci:

00:00.0 Host bridge: Intel Corporation 440FX - 82441FX PMC [Natoma] (rev 02)
00:01.0 ISA bridge: Intel Corporation 82371SB PIIX3 ISA [Natoma/Triton II]
00:01.1 IDE interface: Intel Corporation 82371SB PIIX3 IDE [Natoma/Triton II]
00:01.2 USB controller: Intel Corporation 82371SB PIIX3 USB [Natoma/Triton II] (rev 01)
00:01.3 Bridge: Intel Corporation 82371AB/EB/MB PIIX4 ACPI (rev 03)
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 530 (rev 06)
00:03.0 Ethernet controller: Intel Corporation 82540EM Gigabit Ethernet Controller (rev 03)
00:04.0 3D controller: NVIDIA Corporation GM107M [GeForce GTX 960M] (rev a2)

Output of xrandr --listproviders:

Provider 0: id: 0x75 cap: 0xf, Source Output, Sink Output, Source Offload, Sink Offload crtcs: 3 outputs: 3 associated providers: 0 name:modesetting
Provider 1: id: 0x3f cap: 0x5, Source Output, Source Offload crtcs: 0 outputs: 0 associated providers: 0 name:modesetting

Output of DRI_PRIME=0 glxinfo | grep "OpenGL":

OpenGL vendor string: Intel Open Source Technology Center                                                                                                                                                                                                                      
OpenGL renderer string: Mesa DRI Intel(R) HD Graphics 530 (Skylake GT2)                                                                                                                                                                                                        
OpenGL core profile version string: 4.5 (Core Profile) Mesa 13.0.6                                                                                                                                                                                                             
OpenGL core profile shading language version string: 4.50                                                                                                                                                                                                                      
OpenGL core profile context flags: (none)                                                                                                                                                                                                                                      
OpenGL core profile profile mask: core profile                                                                                                                                                                                                                                 
OpenGL core profile extensions:                                                                                                                                                                                                                                                
OpenGL version string: 3.0 Mesa 13.0.6                                                                                                                                                                                                                                         
OpenGL shading language version string: 1.30                                                                                                                                                                                                                                   
OpenGL context flags: (none)                                                                                                                                                                                                                                                   
OpenGL extensions:                                                                                                                                                                                                                                                             
OpenGL ES profile version string: OpenGL ES 3.2 Mesa 13.0.6                                                                                                                                                                                                                    
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.20                                                                                                                                                                                                      
OpenGL ES profile extensions:  

Output of  DRI_PRIME=1 glxinfo | grep "OpenGL":

OpenGL vendor string: nouveau
OpenGL renderer string: Gallium 0.4 on NV117
OpenGL core profile version string: 4.1 (Core Profile) Mesa 13.0.6
OpenGL core profile shading language version string: 4.10
OpenGL core profile context flags: (none)
OpenGL core profile profile mask: core profile
OpenGL core profile extensions:
OpenGL version string: 3.0 Mesa 13.0.6
OpenGL shading language version string: 1.30
OpenGL context flags: (none)
OpenGL extensions:
OpenGL ES profile version string: OpenGL ES 3.0 Mesa 13.0.6
OpenGL ES profile shading language version string: OpenGL ES GLSL ES 3.00
OpenGL ES profile extensions:

Output of dmesg after crash:

http://paste,debian,net/982675
Comment 1 ebrombugs 2017-08-23 17:55:00 UTC
Created attachment 133721 [details]
dmesg output after the bug is triggered
Comment 2 Chris Wilson 2017-08-23 18:04:57 UTC
It didn't handle a context-switch interrupt and so the ELSP queue was drained -- the hardware was idle, even though we still thought it was processing work.

Does this still happen on a recent kernel? There's a little more info in new error states that may help to debug this problem.
Comment 3 ebrombugs 2017-08-23 20:10:11 UTC
I just tried updating the linux-image package to 4.12.0, now I just get no output on my monitor, instead of the highly unstable output I had with 4.9. I can, however, confirm that the VM still boots up - I can still ssh to it from another device.
Comment 4 ebrombugs 2017-08-23 20:40:48 UTC
Created attachment 133727 [details]
4.12 guest dmesg
Comment 5 ebrombugs 2017-08-23 23:14:33 UTC
Oh, I just realized something: I don't get any freezing at all without the Optimus (https://devtalk.nvidia.com/default/topic/957981/linux/prime-render-offloading-on-nvidia-optimus/) dGPU passed through, it makes sense that the driver would be idle with nouveau+PRIME enabled.
Comment 6 ebrombugs 2017-08-23 23:16:10 UTC
The 4.12 kernel issue is also probably completely unrelated and should be in a separate bug report, filing that one too in a couple of hours.
Comment 7 ebrombugs 2017-08-24 13:54:28 UTC
Created attachment 133747 [details]
I was wrong, just got a GPU hang with only the iGPU passed through on 4.9 (no dGPU/PRIME)
Comment 8 ebrombugs 2017-08-24 14:46:02 UTC
Created attachment 133749 [details]
crash dump from hang w/o dGPU passed through
Comment 9 Jani Saarinen 2018-03-29 07:11:56 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 10 Jani Saarinen 2018-04-25 06:58:17 UTC
Closing, please re-open is issue still exists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.