Bug 107475 - [iGVT-g][SKL] GPU Hang and iGVT-g guest crash under certain loads
Summary: [iGVT-g][SKL] GPU Hang and iGVT-g guest crash under certain loads
Status: NEEDINFO
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/iGVT-g (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Windows (All)
: high major
Assignee: Terrence Xu
QA Contact: Terrence Xu
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-08-03 16:51 UTC by leozinho29_eu
Modified: 2018-12-04 13:26 UTC (History)
4 users (show)

See Also:
i915 platform: SKL
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error (150.04 KB, text/plain)
2018-08-03 16:51 UTC, leozinho29_eu
no flags Details
dmesg and card0/error (281.17 KB, application/x-xz)
2018-09-12 16:43 UTC, leozinho29_eu
no flags Details
dmesg (422.34 KB, text/plain)
2018-10-18 04:54 UTC, leozinho29_eu
no flags Details
dmesg after Mageia 6 guest crashes system (566.72 KB, application/x-xz)
2018-10-24 20:26 UTC, leozinho29_eu
no flags Details
dmesg and video showing what happened (706.27 KB, application/gzip)
2018-10-24 21:38 UTC, leozinho29_eu
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description leozinho29_eu 2018-08-03 16:51:51 UTC
Created attachment 140955 [details]
/sys/class/drm/card0/error

When using a Windows 10 guest with Intel GVT-g with dma-buf, it's noticeable that many graphical workloads have stuttering, some applications may crash and some consistently make the guest crash with a blue screen on the guest and cause a GPU Hang on the host.

To reproduce the problem consistently, a Windows 10 1803 guest with dma-buf using the Intel HD Graphics driver version 24.20.100.6194 is required. The QEMU command line used to start the guest is:

env PULSE_LATENCY_MSEC=10 QEMU_AUDIO_ADC_VOICES=0 QEMU_AUDIO_DRV=pa \
nice -n -15 \
qemu-system-x86_64 -name "Windows 10" -k pt-br -nodefaults \
-mem-prealloc -mem-path /dev/hugepages/libvirt/qemu \
-hda redm.qcow2 \
-hdb redm-D.qcow2 \
-enable-kvm -cpu host -smp cores=2,threads=2 -m 4G \
-device usb-tablet,id=tablet -device usb-host,vendorid=0x1b3f,id=soundcardusb \
-vga none -monitor vc -serial stdio -display gtk,gl=on  \
-device vfio-pci,sysfsdev=/sys/bus/pci/devices/0000:00:02.0/123f09b0-4c00-11e8-a6ca-f3c21e47e012,rombar=0,x-igd-opregion=on,display=on,addr=0x3,id=iHD520 \
-cdrom "mídia.iso" \
-machine kernel_irqchip=on -global PIIX4_PM.disable_s3=1 -global PIIX4_PM.disable_s4=1 -M pc,usb=true \
-netdev bridge,id=hostnet0,br=virbr0 -device e1000,netdev=hostnet0,id=net0,mac=aa:bb:cc:dd:ee:11,addr=0x8

One application I found that consistently causes the blue screen and GPU Hang is a game, that can be downloaded at: https://www.vector.co.jp/download/file/win95/game/fh310532.html Even being a very light workload, it consistently crashes the guest, particularly in the second stage.

It is noticed there is some significant stuttering on the guest that gets worse and worse until the guest crashes with a blue screen (not visible due to lack of VGA modes) and the host suffers a GPU Hang with the following on dmesg:

[ 1748.473459] [drm] GPU HANG: ecode 9:0:0xfacfffff, reason: Hang on rcs0, action: reset
[ 1748.473461] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 1748.473462] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 1748.473462] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 1748.473463] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 1748.473464] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 1748.473484] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[ 1748.998085] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.003878] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.011220] gvt: vgpu 1: untracked MMIO 0000207c len 4
[ 1749.019816] gvt: vgpu 1: untracked MMIO 0000207c len 4

And dozens of the untracked MMIO messages with the same address and same length appear, then the same message with different addresses appear.

Those issues weren't observed with the 15.45 drivers (the certified ones), but they are unusable on Windows 10 as it automatically updates the driver to a non-functional version. 

The Windows 10 guest version is 1803 and is using Intel drivers version 24.20.100.6194. The previous version of the driver, version 24.20.100.6136 does not have those issues, so I think 24.20.100.6194 has a regression making it unusable on iGVT-g guests.

System specifications:

Processor: Intel Core i3-6100U;
Video: Intel HD Graphics 520;
Architecture: amd64;
Mesa: 18.2.0-devel (git-f310e86a42);
Kernel version: 4.17.11-lowlatency;
Distribution: Xubuntu 18.04.1 amd64;
QEMU version: 2.12.91 (v3.0.0-rc3-dirty).
Comment 1 Lakshmi 2018-09-11 07:19:14 UTC
Please try to reproduce the error using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
Comment 2 leozinho29_eu 2018-09-11 14:27:25 UTC
Would the kernel from

http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2018-09-11/

be okay to use to test this? It would save a good time from compiling it.
Comment 3 leozinho29_eu 2018-09-12 16:43:01 UTC
Created attachment 141537 [details]
dmesg and card0/error

I built the kernel (using make localmodconfig it's doable), dmesg was pretty peculiar, as when the guest window was focused messages stopped, and only happened again when updating the driver in the VM (the error messages in dmesg). Maybe this debug flag did not work as expected.

The guest had drivers version 24.20.100.6136 and I upgraded them to 24.20.100.6286 (the latest). When the installation was being made, that many dmesg messages at 1614 seconds appeared and the screen froze for 15 seconds, until the message at 1629 seconds, when it blinked and returned.

On reboot it seemed better than when I reported this bug but, once the example workload was used, a GPU hang happened. Unlike before, it did not cause a blue screen in the guest but the application that was being used crashed. It's not good but it is less worse.

What dmesg does not show is that there were many small freezes in the guest (framerate was 60 but then it fell to 7 and returned later, for example).
Comment 4 leozinho29_eu 2018-10-18 04:54:01 UTC
Created attachment 142077 [details]
dmesg

Using in the guest: 

-Intel driver version 25.20.100.6326;
-Windows 10 18-03 x86_64;

In the host:

-Xubuntu 18.04.1 x86_64;
-Kernel 4.17.19 patched to fix https://bugs.freedesktop.org/show_bug.cgi?id=107899 ;
-QEMU 3.0.0;
-Mesa 18.3.0-devel (git-58a51d0a67);

The guest seems stable. The workloads that were causing the issues I reported here no longer cause problems. It seems as stable as when using driver 25.20.100.6136.

The only new thing I noticed were two QEMU errors in the terminal:

qemu-system-x86_64: vfio_region_write(123f09b0-4c00-11e8-a6ca-f3c21e47e012:region0+0x24ec, 0x83a8,4) failed: Endereço inválido
qemu-system-x86_64: vfio_region_write(123f09b0-4c00-11e8-a6ca-f3c21e47e012:region0+0x24ec, 0x83a8,4) failed: Endereço inválido

"Endereço inválido" means "Invalid address". The first one appeared when upgrading the drivers and the second on reboot. dmesg is red with messages as:

[14200.651340] gvt: guest page write error, gpa 11c643e90

which was generated when upgrading the driver. And:

[14411.555755] gvt: vgpu 1: untracked MMIO 00004084 len 4

which appears in multiple circumstances as resolution changes. 4.18 and newer don't have this messages but https://bugs.freedesktop.org/show_bug.cgi?id=107945 is still present with 4.19-rc8 and drm-tip.

The Windows 10 guest using iGVT-g is running very well, nearly everything I tried worked on it as it worked on Windows 10 host, only with a bit of overhead.
Comment 5 leozinho29_eu 2018-10-22 23:59:57 UTC
The GPU hangs are no longer present, but the error in QEMU 3.0.0:

qemu-system-x86_64: vfio_region_write(123f09b0-4c00-11e8-a6ca-f3c21e47e012:region0+0x24ec, 0x83a8,4) failed: Endereço inválido

Means it's not possible to ensure the guest and host system will be stable. At the same time QEMU 3.0.0 prints the error, dmesg prints messages related:

[  600.375287] gvt: vgpu(1) Invalid FORCE_NONPRIV write 83a8
[  600.375292] gvt: vgpu 1: fail to emulate MMIO write 000024ec len 4

Under heavy guest load, the guest and/or host can crash. QEMU from KVMGT release (https://lists.freedesktop.org/archives/intel-gfx/2018-October/179043.html) is silent, not printing anything about the error.

The guest can crash with a blue screen and the host is OK. The guest may crash and the host can crash together.

When the host crashes, dmesg has a message in the style of a kernel WARNING or kernel BUG but the computer freezes and the message is not saved for reboot. The following messages were the last ones saved before the host crash:

Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188019] gvt: guest page write error, gpa 28f4c000
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188032] gvt: guest page write error, gpa 28f4c010
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188043] gvt: guest page write error, gpa 28f4c020
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188052] gvt: guest page write error, gpa 28f4c030
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188062] gvt: guest page write error, gpa 28f4c040
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188072] gvt: guest page write error, gpa 28f4c050
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188081] gvt: guest page write error, gpa 28f4c060
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188091] gvt: guest page write error, gpa 28f4c070
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188101] gvt: guest page write error, gpa 28f4c080
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188110] gvt: guest page write error, gpa 28f4c090
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188120] gvt: guest page write error, gpa 28f4c0a0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188131] gvt: guest page write error, gpa 28f4c0b0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188143] gvt: guest page write error, gpa 28f4c0c0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188156] gvt: guest page write error, gpa 28f4c0d0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188297] gvt: guest page write error, gpa 28f4c0e0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188311] gvt: guest page write error, gpa 28f4c0f0
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188323] gvt: guest page write error, gpa 28f4c100
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188333] gvt: guest page write error, gpa 28f4c110
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188343] gvt: guest page write error, gpa 28f4c120
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188352] gvt: guest page write error, gpa 28f4c130
Oct 22 20:04:32 Lenovo-ideapad-310-14ISK kernel: [ 4102.188362] gvt: guest page write error, gpa 28f4c140

I tested with QEMU from KVMGT release and QEMU 3.0.0, kernel 4.17.19, from the KVMGT release and drm-tip. Any combination of them causes the problems when the guest is under heavy load.

The only option for now is to use the Intel Windows driver version 24.20.100.6136, as it has no error related to:

[  600.375292] gvt: vgpu 1: fail to emulate MMIO write 000024ec len 4
Comment 6 Lakshmi 2018-10-23 06:14:17 UTC
Can you attach full dmesg from boot with kernel parameters drm.debug=0x1e log_buf_len=4M.
Comment 7 Lakshmi 2018-10-23 08:30:34 UTC
I assume this is a gvt bug. Changing the component.
Comment 8 leozinho29_eu 2018-10-24 20:26:11 UTC
Created attachment 142178 [details]
dmesg after Mageia 6 guest crashes system

This dmesg is from a system crash caused when booting Mageia 6 Live ISO when using drm-tip in the host. While it's not a Windows guest, it crashed in a similar way as the Windows guest crashed: video and input didn't work and sound worked correctly until the track that was being played ended, then sound stopped too.

I'm not 100% sure this crash is related to problems with Windows guests. I set the log buffer length to 64 MB now, hopefully I'll be able to get an entire dmesg if/when the Windows 10 guest crashes.
Comment 9 leozinho29_eu 2018-10-24 21:38:49 UTC
Created attachment 142180 [details]
dmesg and video showing what happened

This is what happened with Windows 10 guest when upgrading the drivers. Apparently the filesystem access was lost so the logs weren't saved, but dmesg still managed to print messages.
Comment 10 Terrence Xu 2018-12-03 14:22:15 UTC
(In reply to leozinho29_eu from comment #9)
> Created attachment 142180 [details]
> dmesg and video showing what happened
> 
> This is what happened with Windows 10 guest when upgrading the drivers.
> Apparently the filesystem access was lost so the logs weren't saved, but
> dmesg still managed to print messages.

Suggest you to switch guest GFX driver to our validated version: https://downloadcenter.intel.com/download/28240/Legacy-Intel-Graphics-Driver-for-Windows-10 (25.20.100.6326).
Comment 11 leozinho29_eu 2018-12-04 13:18:34 UTC
Using gvt-stable-4.17 kernel it's noticeable it is more stable when compared to other kernel versions, as 4.18.0-12 (Ubuntu Bionic Beaver HWE Edge) or drm-tip. The Intel Windows driver version 25.20.100.6326 seems to be more unstable than 24.20.100.6136 even when using gvt-stable-4.17. 

The most stable Intel Windows driver version to me is 24.20.100.6136 and the most stable Linux kernel version is gvt-stable-4.17.

I noticed that other, more unstable kernel version print the following message when booting the Windows 10 guest with 25.20.100.6326:

[2612.248481] gvt: vgpu(1) Invalid FORCE_NONPRIV write 83a8 at offset 24ec

It can't be guaranteed the system is stable when the above message appears. Under heavy load, mainly heavy memory usage, hundreds messages like:

[3803.580306] gvt: guest page write error, gpa 690814b0

Can appear on dmesg. And, with drm-tip, a kernel panic is likely to happen, while with 4.18.0-12 and gvt-stable-4.17 it may continue working normally or blue screen the guest, more likely to continue working when using 24.20.100.6136.

From my experiences I would say 25.20.100.6326 is worse than 24.20.100.6136 stability-wise. The gvt-stable-4.17 kernel is better than others stability-wise.
Comment 12 Terrence Xu 2018-12-04 13:26:12 UTC
(In reply to leozinho29_eu from comment #11)

> [2612.248481] gvt: vgpu(1) Invalid FORCE_NONPRIV write 83a8 at offset 24ec
The following patch is available to fix the problem, you can have a try.
drm/i915/gvt: update force-to-nonpriv register whitelist
commit b1440caf6cecfffa310c60fc585e50e76bf6ddd5

> [3803.580306] gvt: guest page write error, gpa 690814b0
This error log without bad behavior, we are tracking it now in our internal bugzilla.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.