110951 – GPU hang while using kvmgt

Bug 110951 - GPU hang while using kvmgt

Summary: GPU hang while using kvmgt

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2019-06-20 12:34 UTC by velde666
Modified:	2019-11-29 19:10 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	KBL
i915 features:	GPU hang

Attachments
GPU crash dump (27.12 KB, text/plain) 2019-06-20 12:34 UTC, velde666	no flags	Details
kernel messages (175.16 KB, application/gzip) 2019-06-20 12:35 UTC, velde666	no flags	Details
/sys/class/drm/card0/error file (45.70 KB, text/plain) 2019-06-20 15:48 UTC, F. Delente	no flags	Details
2nd gpu crash dump (26.84 KB, text/plain) 2019-06-27 20:45 UTC, velde666	no flags	Details
messages when crash occurred (2.41 MB, text/plain) 2019-11-11 18:42 UTC, velde666	no flags	Details
attachment-20508-0.html (1.75 KB, text/html) 2019-11-12 13:18 UTC, velde666	no flags	Details
attachment-3098-0.html (1.06 KB, text/html) 2019-11-12 20:17 UTC, velde666	no flags	Details
more kernel messages (2.05 MB, text/plain) 2019-11-17 12:38 UTC, velde666	no flags	Details
kernel traces (3.25 MB, text/plain) 2019-11-17 12:39 UTC, velde666	no flags	Details
View All

Description velde666 2019-06-20 12:34:29 UTC

Created attachment 144598 [details]
GPU crash dump

Hi there,

I am running CentOS 7 on an Intel NUC7i3BNH with with KVM/QEMU using GPU virtualization passthrough to a Windows 10 VM. 

Kernel is 5.1.9 compiled with merged configs from standard CentOS kernel 3.10.x and 5.1.9 kernel-ml from elrepo-kernel.

I am getting random bluescreen in Windows (also with standard CentOS 3.10.x kernels) and kernel messages containing "gvt: guest page write error" and "gvt: vgpu(1) Invalid FORCE_NONPRIV write" and "gvt: vgpu 1: fail: spt" and "gvt: vgpu 1: fail: shadow page" and others as well as 

Jun 16 13:38:30 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jun 16 13:38:30 floor13 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jun 16 13:38:30 floor13 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jun 16 13:38:30 floor13 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jun 16 13:38:30 floor13 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error

which is why I am here :)

I will attach kernel messages containing gvt stuff as well as the GPU crash dump.

Thanks in advance for your much appereciated work!

Alex

Comment 1 velde666 2019-06-20 12:35:58 UTC

Created attachment 144599 [details]
kernel messages

Comment 2 velde666 2019-06-20 12:38:20 UTC

00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02) (prog-if 00 [VGA controller])
	Subsystem: Intel Corporation Device 2068
	Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
	Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
	Latency: 0, Cache Line Size: 64 bytes
	Interrupt: pin A routed to IRQ 129
	Region 0: Memory at db000000 (64-bit, non-prefetchable) [size=16M]
	Region 2: Memory at 90000000 (64-bit, prefetchable) [size=256M]
	Region 4: I/O ports at f000 [size=64]
	[virtual] Expansion ROM at 000c0000 [disabled] [size=128K]
	Capabilities: [40] Vendor Specific Information: Len=0c <?>
	Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00
		DevCap:	MaxPayload 128 bytes, PhantFunc 0
			ExtTag- RBE+
		DevCtl:	Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
			RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop-
			MaxPayload 128 bytes, MaxReadReq 128 bytes
		DevSta:	CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
		DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported
		DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
	Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit-
		Address: fee00018  Data: 0000
	Capabilities: [d0] Power Management version 2
		Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-)
		Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
	Capabilities: [100 v1] Process Address Space ID (PASID)
		PASIDCap: Exec- Priv-, Max PASID Width: 14
		PASIDCtl: Enable- Exec- Priv-
	Capabilities: [200 v1] Address Translation Service (ATS)
		ATSCap:	Invalidate Queue Depth: 00
		ATSCtl:	Enable+, Smallest Translation Unit: 00
	Capabilities: [300 v1] Page Request Interface (PRI)
		PRICtl: Enable- Reset-
		PRISta: RF- UPRGI- Stopped+
		Page Request Capacity: 00008000, Page Request Allocation: 00000000
	Kernel driver in use: i915
	Kernel modules: i915

Comment 3 Chris Wilson 2019-06-20 12:42:07 UTC

(In reply to velde666 from comment #0)
> Created attachment 144598 [details]
> GPU crash dump

That looks like a plain old userspace hang in the virt-viewer, so mesa most likely. However, given the involvement of gvt, we should double check the GPU state is sane...

Comment 4 F. Delente 2019-06-20 15:47:31 UTC

Hello,

I have the same error message on my laptop (no VM with GPU passthrough, just a plain laptop :^)

In dmesg, I find these messages:

[ 1781.226890] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=107045 end=107046) time 310 us, min 763, max 767, scanline start 755, end 765

and

[ 2490.145236] i915 0000:00:02.0: GPU HANG: ecode 7:1:0xfffffffe, in Xorg [1021], hang on rcs0
[ 2490.145238] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 2490.145239] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 2490.145240] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 2490.145241] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 2490.145242] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 2490.145309] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 2511.190540] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 2527.190115] i915 0000:00:02.0: Resetting chip for hang on rcs0
[ 2546.197618] i915 0000:00:02.0: Resetting chip for hang on rcs0 etc.

The /sys/class/drm/card0/error file is in attachment.

Thanks!

Comment 5 F. Delente 2019-06-20 15:48:36 UTC

Created attachment 144601 [details]
/sys/class/drm/card0/error file

Comment 6 velde666 2019-06-26 08:29:36 UTC

Hi there,

short notice from me: I updated to kernel 5.1.12 with same configuration and the problem seems to be gone *knocking_on_wood* 

Best regards
Alex

Comment 7 velde666 2019-06-27 20:44:12 UTC

ok, bug strikes again:

Jun 27 22:41:21 floor13 kernel: DMAR: DRHD: handling fault status reg 3
Jun 27 22:41:21 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 3e6c000 [fault reason 07] Next page table ptr is invalid
Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in virt-viewer [27642], hang on rcs0
Jun 27 22:41:27 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jun 27 22:41:27 floor13 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jun 27 22:41:27 floor13 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jun 27 22:41:27 floor13 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jun 27 22:41:27 floor13 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jun 27 22:41:35 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Jun 27 22:41:43 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 8 velde666 2019-06-27 20:45:30 UTC

Created attachment 144674 [details]
2nd gpu crash dump

Comment 9 velde666 2019-06-27 20:47:16 UTC

(In reply to velde666 from comment #7)
> ok, bug strikes again:
> 
> Jun 27 22:41:21 floor13 kernel: DMAR: DRHD: handling fault status reg 3
> Jun 27 22:41:21 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0]
> fault addr 3e6c000 [fault reason 07] Next page table ptr is invalid
> Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: GPU HANG: ecode
> 9:1:0xfffffffe, in virt-viewer [27642], hang on rcs0
> Jun 27 22:41:27 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere
> in the entire gfx stack, including userspace.
> Jun 27 22:41:27 floor13 kernel: [drm] Please file a _new_ bug report on
> bugs.freedesktop.org against DRI -> DRM/Intel
> Jun 27 22:41:27 floor13 kernel: [drm] drm/i915 developers can then reassign
> to the right component if it's not a kernel issue.
> Jun 27 22:41:27 floor13 kernel: [drm] The gpu crash dump is required to
> analyze gpu hangs, so please always attach it.
> Jun 27 22:41:27 floor13 kernel: [drm] GPU crash dump saved to
> /sys/class/drm/card0/error
> Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang
> on rcs0
> Jun 27 22:41:35 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang
> on rcs0
> Jun 27 22:41:43 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang
> on rcs0

crash happend when I connected via mstsc/rdp

Comment 10 Francesco Balestrieri 2019-11-11 07:09:26 UTC

velde666, F. Delente, sorry for not replying here for such a long time. Are you still encountering the problem? With what frequency? Thanks.

Comment 11 velde666 2019-11-11 18:41:44 UTC

Hi Francesco,

no problem, life keeps one busy :)

I am still facing problems with kvmgt and a Windows 10 vm but I am not quite sure if this is still the same issue.

In the meantime I first updated to kernel 5.2 and faced following issue: https://bugs.freedesktop.org/show_bug.cgi?id=111582

Then I updated to kernel gvt-staging (5.4.0-rc4-17331-g9074777-dirty) from https://github.com/intel/gvt-linux which works much better and crashes not that often but still does. Unfortunately the qemu process gets zombified when the vm crashes and I have to reboot the host (and all other VMs) to get the Windows vm running again :\

I cannot reproduce when or why the crashes occur. I tried to update Intel GPU drivers in the vm but everything newer than version 26.20.100.6709 from 11th of April 2019 leads to a non working passed-through gpu in Windows as it cannot be started.

I will attach messages from the last crash that happened while the Windows 10 VM was idle with some programs open which had been started hours/days before and ran without problems. 

Best regards
Alex

Comment 12 velde666 2019-11-11 18:42:43 UTC

Created attachment 145935 [details]
messages when crash occurred

Comment 13 velde666 2019-11-11 18:45:56 UTC

At Nov  9 23:13:16 there is following entry 

"kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." 

followed by some crash dump

Comment 14 Lakshmi 2019-11-12 13:08:59 UTC

(In reply to velde666 from comment #13)
> At Nov  9 23:13:16 there is following entry 
> 
> "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." 
> 
> followed by some crash dump

Can you attach the crash dump file? How often you a crash occurs?

Comment 15 velde666 2019-11-12 13:16:04 UTC

Oh, I thought that the messages are the crash dump. 

There is no crash dump file even though kdump service is configured and running. I haven't troubleshooted this yet.


-------- Ursprüngliche Nachricht --------
Von: bugzilla-daemon@freedesktop.org
Gesendet: 12. November 2019 14:08:59 MEZ
An: velde666@gmail.com
Betreff: [Bug 110951] GPU hang while using kvmgt

https://bugs.freedesktop.org/show_bug.cgi?id=110951

--- Comment #14 from Lakshmi <lakshminarayana.vudum@intel.com> ---
(In reply to velde666 from comment #13)
> At Nov  9 23:13:16 there is following entry 
> 
> "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." 
> 
> followed by some crash dump

Can you attach the crash dump file? How often you a crash occurs?

Comment 16 velde666 2019-11-12 13:18:16 UTC

Created attachment 145941 [details]
attachment-20508-0.html

Oh and regarding the frequency, I would say that it occurs once or twice a week.


-------- Ursprüngliche Nachricht --------
Von: bugzilla-daemon@freedesktop.org
Gesendet: 12. November 2019 14:08:59 MEZ
An: velde666@gmail.com
Betreff: [Bug 110951] GPU hang while using kvmgt

https://bugs.freedesktop.org/show_bug.cgi?id=110951

--- Comment #14 from Lakshmi <lakshminarayana.vudum@intel.com> ---
(In reply to velde666 from comment #13)
> At Nov  9 23:13:16 there is following entry 
> 
> "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." 
> 
> followed by some crash dump

Can you attach the crash dump file? How often you a crash occurs?

Comment 17 Lakshmi 2019-11-12 13:31:52 UTC

(In reply to velde666 from comment #16)
> Created attachment 145941 [details]
> attachment-20508-0.html
> 
> Oh and regarding the frequency, I would say that it occurs once or twice a
> week.
> 
> 
Crash dump file will be available here /sys/class/drm/card0/error when gpu hangs.

Comment 18 velde666 2019-11-12 13:35:09 UTC

Ok, I will upload that file next time the crash occurred.

Comment 19 velde666 2019-11-12 20:17:52 UTC

Created attachment 145943 [details]
attachment-3098-0.html

This evening there was a new issue after which the mouse pointer was
distorted in the Windows VM (nothing else). After a VM shutdown and restart
everything was fine again.

The host had following messages while the issue occurred:

Nov 12 21:01:15 floor13 kernel: DMAR: DRHD: handling fault status reg 2
Nov 12 21:01:15 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0]
PASID ffffffff fault addr ffff800409263000 [fault reason 07] Next page
table ptr is invalid
Nov 12 21:01:21 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for
preemption time out
Nov 12 21:04:24 floor13 kernel: DMAR: DRHD: handling fault status reg 3
Nov 12 21:04:24 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0]
PASID ffffffff fault addr fffffffff0008000 [fault reason 07] Next page
table ptr is invalid
Nov 12 21:04:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for
preemption time out

crash dump file is empty

Comment 20 velde666 2019-11-17 12:36:11 UTC

Hi there,

this morning at 11:05 the Windows vm crashed again with loads of following messages:

Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000676a8bff guest entry 0xfb286b90fb286b9 type 9
Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: spt 00000000bd066075 guest entry 0xfb286b90fb286b9 type 9
Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000bd066075 guest entry 0xfb286b90fb286b9 type 9.
Nov 17 11:05:04 floor13 kernel: gvt: guest page write error, gpa 194e47000

ending with

Nov 17 11:05:05 floor13 kernel: gvt: vgpu 1: fail to flush post shadow
Nov 17 11:05:05 floor13 kernel: gvt: vgpu 1: fail to dispatch workload, skip



After that I got several kernel traces starting with

Nov 17 11:07:46 floor13 kernel: INFO: task gvt_service_thr:301 blocked for more than 122 seconds.

repeating every 2 minutes and ending at 11:15:58.



Unfortunately no dump was written in /sys/class/drm/card0/error

[root@floor13 ~]# cat /sys/class/drm/card0/error
No error state collected

I had to destroy the vm leaving the corresponding qemu-kvm process zombified. As long as I do not restart the host I am not able to restart the Windows vm as the zombie process blocks all connected devices

Any help or suggestions much appreciated :\

Best regards
Alex

Comment 21 velde666 2019-11-17 12:38:50 UTC

Created attachment 145982 [details]
more kernel messages

Comment 22 velde666 2019-11-17 12:39:53 UTC

Created attachment 145983 [details]
kernel traces

Comment 23 Martin Peres 2019-11-29 19:10:31 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/317.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.