Created attachment 145476 [details]
the card error state
I use synergy to control my desktop from my laptop, and simply moving the mouse from the remote screen to the local one, triggered this:
Sep 23 22:46:30 mdontu-l kernel: DMAR: DRHD: handling fault status reg 3
Sep 23 22:46:30 mdontu-l kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 94f99000 [fault reason 05] PTE Write access is not set
Sep 23 22:48:51 mdontu-l kernel: i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0
Sep 23 22:48:51 mdontu-l kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 23 22:48:51 mdontu-l kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 23 22:48:51 mdontu-l kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 23 22:48:51 mdontu-l kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep 23 22:48:51 mdontu-l kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep 23 22:48:51 mdontu-l kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0
It has happened three or four times before and only with Linux 5.3.x. Linux 5.2 did not exhibit this behaviour. I use suspend to RAM quite often. Maybe that plays a role. Anyway, it does not bother me. The screen is frozen for a few seconds and then everything is back to normal.
Created attachment 145477 [details]
My dmesg has some IOMMU-related errors, but it has been doing that for the last year or so. No noticeable problem so far.
Created attachment 145478 [details]
Here is the output of lspci -v, just in case.
Created attachment 145479 [details]
Empirical evidence suggests that context restore hang is related to iommu.
Created attachment 145578 [details]
[drm] GPU crash dump saved to /sys/class/drm/card0/error
I am experiencing similar problems to Mihai above.
My system (Arch Linux with Windows 10 guest, using IOMMU GPU passthrough) has been working for over a year, but started freezing on the Host side a few days ago (after updates).
[ 812.306423] i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0
[ 812.306424] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 812.306436] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 812.306436] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 812.306436] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 812.306436] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 812.306639] i915 0000:00:02.0: Resetting chip for hang on rcs0
Last line: "Resetting chip for hang on rcs0" is repeated a lot, which comes with symptoms of system hangs for ~3s every 10-15s. Especially when the QEMU Guest is running and the host system is in use (mouse movement).
Also, internet functionality is impaired - recovers itself after a few seconds of wait usually.
As my system is basically unusable (because of frequent freezes, which also impact the internet stability as side-effect), I tried the Arch Linux LTS Kernel (4.19.75-1-lts) and there seems to be no bug.
-> No freezes for half an hour with Windows 10 guest running (with the newest Kernel, the freezing starts even without starting the guest system after a few minutes).
I've got the problem that whenever I do disk traffic, after a few minutes the GUI becomes laggy. Soon it locks completely.
Sometimes the GUI simply locks, without any background work done.
I once got the mentioned GPU HANG message but I didn't fetch the error log.
So I am still trying to produce this error message again. error log still is empty.
I know of the problems with Linux and swap, but this is just far worse and doesn't happen at all on my laptop with very similar setup
how to reproduce:
- just read or write a lot of data, let's say 100 GB via DD, CP, GUI... as you like
- linux cache size increases until it starts swapping
- SWAPOUT is around 10 to 100 thousend pages (per atop cycle I believe)
- after a few minutes the GUI is locked.
- when the job is finished the gui suddenly is ok again. and this completely without the usual lagging known from a swapping system.
- Kubuntu 19.10
- Kernel 5.3.0-19
- Intel i5-9600K
I already tried, without success:
- fresh installation of Kubuntu on a new disk
- new mouse
- vm.swappiness = 10 and vm.vfs_cache_pressure=50
- uninstalling xorg-intel driver
BTW, as alex wrote, I believe this problem disappeared with kernel 4.15.0
another remark. I usually don't have any swap, because of the bad behavior of Linux when swapping. I only do have swap now, because else in the exact same circumstances as above kswapd went amok instead, making the system even more unstable/unusable.
Alex, could you just try what happens with your GUI when you produce much disk traffic?
high disk usage seems to do nothing bad when using kernel 4.19.81-1-lts.
Using the latest kernel gives me problems, regardless what I do.
Running a VM with GPU passthrough makes things bad (resets and errors) - but running a VM comes with considerable DISK IO - esp. when running games.
Is there anything specific you'd need me to do?
I checked again and yes, my problem vanishes if I boot linux-image-4.15.0-1050-oem which is the only 4.x kernel available in Ubuntu 19.10 repo.
With Kernel 4 I get the expected behavior of a swapping system. It lags and is hard to use in times. But it's not completely locked for several minutes.
Does this proof that it's NOT Freedesktop/xorg related?
How long has the GUI to hang until I should get the above error message in the syslog?
I had a problem with VirtualBox as well.
My host always crashed because of kswapd running amok after a few minutes of running the VM. My solution was to add a small swap space (1 GB). Since then the PC is suffering from the GUI lock mentioned here.
so, do you have SWAP? What happens with the VM if you remove all swap space of the host?
did you ever notice kswapd taking 100% cpu?
what do you mean with GPU passthrough? is this VMware specific?