Created attachment 144598 [details] GPU crash dump Hi there, I am running CentOS 7 on an Intel NUC7i3BNH with with KVM/QEMU using GPU virtualization passthrough to a Windows 10 VM. Kernel is 5.1.9 compiled with merged configs from standard CentOS kernel 3.10.x and 5.1.9 kernel-ml from elrepo-kernel. I am getting random bluescreen in Windows (also with standard CentOS 3.10.x kernels) and kernel messages containing "gvt: guest page write error" and "gvt: vgpu(1) Invalid FORCE_NONPRIV write" and "gvt: vgpu 1: fail: spt" and "gvt: vgpu 1: fail: shadow page" and others as well as Jun 16 13:38:30 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jun 16 13:38:30 floor13 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jun 16 13:38:30 floor13 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jun 16 13:38:30 floor13 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jun 16 13:38:30 floor13 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error which is why I am here :) I will attach kernel messages containing gvt stuff as well as the GPU crash dump. Thanks in advance for your much appereciated work! Alex
Created attachment 144599 [details] kernel messages
00:02.0 VGA compatible controller: Intel Corporation HD Graphics 620 (rev 02) (prog-if 00 [VGA controller]) Subsystem: Intel Corporation Device 2068 Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+ Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx- Latency: 0, Cache Line Size: 64 bytes Interrupt: pin A routed to IRQ 129 Region 0: Memory at db000000 (64-bit, non-prefetchable) [size=16M] Region 2: Memory at 90000000 (64-bit, prefetchable) [size=256M] Region 4: I/O ports at f000 [size=64] [virtual] Expansion ROM at 000c0000 [disabled] [size=128K] Capabilities: [40] Vendor Specific Information: Len=0c <?> Capabilities: [70] Express (v2) Root Complex Integrated Endpoint, MSI 00 DevCap: MaxPayload 128 bytes, PhantFunc 0 ExtTag- RBE+ DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported- RlxdOrd- ExtTag- PhantFunc- AuxPwr- NoSnoop- MaxPayload 128 bytes, MaxReadReq 128 bytes DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend- DevCap2: Completion Timeout: Not Supported, TimeoutDis-, LTR-, OBFF Not Supported DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled Capabilities: [ac] MSI: Enable+ Count=1/1 Maskable- 64bit- Address: fee00018 Data: 0000 Capabilities: [d0] Power Management version 2 Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot-,D3cold-) Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME- Capabilities: [100 v1] Process Address Space ID (PASID) PASIDCap: Exec- Priv-, Max PASID Width: 14 PASIDCtl: Enable- Exec- Priv- Capabilities: [200 v1] Address Translation Service (ATS) ATSCap: Invalidate Queue Depth: 00 ATSCtl: Enable+, Smallest Translation Unit: 00 Capabilities: [300 v1] Page Request Interface (PRI) PRICtl: Enable- Reset- PRISta: RF- UPRGI- Stopped+ Page Request Capacity: 00008000, Page Request Allocation: 00000000 Kernel driver in use: i915 Kernel modules: i915
(In reply to velde666 from comment #0) > Created attachment 144598 [details] > GPU crash dump That looks like a plain old userspace hang in the virt-viewer, so mesa most likely. However, given the involvement of gvt, we should double check the GPU state is sane...
Hello, I have the same error message on my laptop (no VM with GPU passthrough, just a plain laptop :^) In dmesg, I find these messages: [ 1781.226890] [drm:intel_pipe_update_end [i915]] *ERROR* Atomic update failure on pipe A (start=107045 end=107046) time 310 us, min 763, max 767, scanline start 755, end 765 and [ 2490.145236] i915 0000:00:02.0: GPU HANG: ecode 7:1:0xfffffffe, in Xorg [1021], hang on rcs0 [ 2490.145238] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [ 2490.145239] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [ 2490.145240] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [ 2490.145241] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [ 2490.145242] [drm] GPU crash dump saved to /sys/class/drm/card0/error [ 2490.145309] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 2511.190540] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 2527.190115] i915 0000:00:02.0: Resetting chip for hang on rcs0 [ 2546.197618] i915 0000:00:02.0: Resetting chip for hang on rcs0 etc. The /sys/class/drm/card0/error file is in attachment. Thanks!
Created attachment 144601 [details] /sys/class/drm/card0/error file
Hi there, short notice from me: I updated to kernel 5.1.12 with same configuration and the problem seems to be gone *knocking_on_wood* Best regards Alex
ok, bug strikes again: Jun 27 22:41:21 floor13 kernel: DMAR: DRHD: handling fault status reg 3 Jun 27 22:41:21 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 3e6c000 [fault reason 07] Next page table ptr is invalid Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: GPU HANG: ecode 9:1:0xfffffffe, in virt-viewer [27642], hang on rcs0 Jun 27 22:41:27 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. Jun 27 22:41:27 floor13 kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel Jun 27 22:41:27 floor13 kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. Jun 27 22:41:27 floor13 kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. Jun 27 22:41:27 floor13 kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Jun 27 22:41:35 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0 Jun 27 22:41:43 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
Created attachment 144674 [details] 2nd gpu crash dump
(In reply to velde666 from comment #7) > ok, bug strikes again: > > Jun 27 22:41:21 floor13 kernel: DMAR: DRHD: handling fault status reg 3 > Jun 27 22:41:21 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0] > fault addr 3e6c000 [fault reason 07] Next page table ptr is invalid > Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: GPU HANG: ecode > 9:1:0xfffffffe, in virt-viewer [27642], hang on rcs0 > Jun 27 22:41:27 floor13 kernel: [drm] GPU hangs can indicate a bug anywhere > in the entire gfx stack, including userspace. > Jun 27 22:41:27 floor13 kernel: [drm] Please file a _new_ bug report on > bugs.freedesktop.org against DRI -> DRM/Intel > Jun 27 22:41:27 floor13 kernel: [drm] drm/i915 developers can then reassign > to the right component if it's not a kernel issue. > Jun 27 22:41:27 floor13 kernel: [drm] The gpu crash dump is required to > analyze gpu hangs, so please always attach it. > Jun 27 22:41:27 floor13 kernel: [drm] GPU crash dump saved to > /sys/class/drm/card0/error > Jun 27 22:41:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang > on rcs0 > Jun 27 22:41:35 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang > on rcs0 > Jun 27 22:41:43 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for hang > on rcs0 crash happend when I connected via mstsc/rdp
velde666, F. Delente, sorry for not replying here for such a long time. Are you still encountering the problem? With what frequency? Thanks.
Hi Francesco, no problem, life keeps one busy :) I am still facing problems with kvmgt and a Windows 10 vm but I am not quite sure if this is still the same issue. In the meantime I first updated to kernel 5.2 and faced following issue: https://bugs.freedesktop.org/show_bug.cgi?id=111582 Then I updated to kernel gvt-staging (5.4.0-rc4-17331-g9074777-dirty) from https://github.com/intel/gvt-linux which works much better and crashes not that often but still does. Unfortunately the qemu process gets zombified when the vm crashes and I have to reboot the host (and all other VMs) to get the Windows vm running again :\ I cannot reproduce when or why the crashes occur. I tried to update Intel GPU drivers in the vm but everything newer than version 26.20.100.6709 from 11th of April 2019 leads to a non working passed-through gpu in Windows as it cannot be started. I will attach messages from the last crash that happened while the Windows 10 VM was idle with some programs open which had been started hours/days before and ran without problems. Best regards Alex
Created attachment 145935 [details] messages when crash occurred
At Nov 9 23:13:16 there is following entry "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." followed by some crash dump
(In reply to velde666 from comment #13) > At Nov 9 23:13:16 there is following entry > > "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." > > followed by some crash dump Can you attach the crash dump file? How often you a crash occurs?
Oh, I thought that the messages are the crash dump. There is no crash dump file even though kdump service is configured and running. I haven't troubleshooted this yet. -------- Ursprüngliche Nachricht -------- Von: bugzilla-daemon@freedesktop.org Gesendet: 12. November 2019 14:08:59 MEZ An: velde666@gmail.com Betreff: [Bug 110951] GPU hang while using kvmgt https://bugs.freedesktop.org/show_bug.cgi?id=110951 --- Comment #14 from Lakshmi <lakshminarayana.vudum@intel.com> --- (In reply to velde666 from comment #13) > At Nov 9 23:13:16 there is following entry > > "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." > > followed by some crash dump Can you attach the crash dump file? How often you a crash occurs?
Created attachment 145941 [details] attachment-20508-0.html Oh and regarding the frequency, I would say that it occurs once or twice a week. -------- Ursprüngliche Nachricht -------- Von: bugzilla-daemon@freedesktop.org Gesendet: 12. November 2019 14:08:59 MEZ An: velde666@gmail.com Betreff: [Bug 110951] GPU hang while using kvmgt https://bugs.freedesktop.org/show_bug.cgi?id=110951 --- Comment #14 from Lakshmi <lakshminarayana.vudum@intel.com> --- (In reply to velde666 from comment #13) > At Nov 9 23:13:16 there is following entry > > "kernel: INFO: task gvt_service_thr:294 blocked for more than 122 seconds." > > followed by some crash dump Can you attach the crash dump file? How often you a crash occurs?
(In reply to velde666 from comment #16) > Created attachment 145941 [details] > attachment-20508-0.html > > Oh and regarding the frequency, I would say that it occurs once or twice a > week. > > Crash dump file will be available here /sys/class/drm/card0/error when gpu hangs.
Ok, I will upload that file next time the crash occurred.
Created attachment 145943 [details] attachment-3098-0.html This evening there was a new issue after which the mouse pointer was distorted in the Windows VM (nothing else). After a VM shutdown and restart everything was fine again. The host had following messages while the issue occurred: Nov 12 21:01:15 floor13 kernel: DMAR: DRHD: handling fault status reg 2 Nov 12 21:01:15 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault addr ffff800409263000 [fault reason 07] Next page table ptr is invalid Nov 12 21:01:21 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for preemption time out Nov 12 21:04:24 floor13 kernel: DMAR: DRHD: handling fault status reg 3 Nov 12 21:04:24 floor13 kernel: DMAR: [DMA Write] Request device [00:02.0] PASID ffffffff fault addr fffffffff0008000 [fault reason 07] Next page table ptr is invalid Nov 12 21:04:27 floor13 kernel: i915 0000:00:02.0: Resetting rcs0 for preemption time out crash dump file is empty
Hi there, this morning at 11:05 the Windows vm crashed again with loads of following messages: Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000676a8bff guest entry 0xfb286b90fb286b9 type 9 Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: spt 00000000bd066075 guest entry 0xfb286b90fb286b9 type 9 Nov 17 11:05:04 floor13 kernel: gvt: vgpu 1: fail: shadow page 00000000bd066075 guest entry 0xfb286b90fb286b9 type 9. Nov 17 11:05:04 floor13 kernel: gvt: guest page write error, gpa 194e47000 ending with Nov 17 11:05:05 floor13 kernel: gvt: vgpu 1: fail to flush post shadow Nov 17 11:05:05 floor13 kernel: gvt: vgpu 1: fail to dispatch workload, skip After that I got several kernel traces starting with Nov 17 11:07:46 floor13 kernel: INFO: task gvt_service_thr:301 blocked for more than 122 seconds. repeating every 2 minutes and ending at 11:15:58. Unfortunately no dump was written in /sys/class/drm/card0/error [root@floor13 ~]# cat /sys/class/drm/card0/error No error state collected I had to destroy the vm leaving the corresponding qemu-kvm process zombified. As long as I do not restart the host I am not able to restart the Windows vm as the zombie process blocks all connected devices Any help or suggestions much appreciated :\ Best regards Alex
Created attachment 145982 [details] more kernel messages
Created attachment 145983 [details] kernel traces
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/317.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.