111790 – [hsw iommu] i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0

Bug 111790 - [hsw iommu] i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0

Summary: [hsw iommu] i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0

Status:	RESOLVED MOVED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium critical
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged
Keywords:

Depends on:
Blocks:

Reported:	2019-09-23 20:01 UTC by Mihai Dontu
Modified:	2019-11-29 19:33 UTC (History)
CC List:	5 users (show)

See Also:
i915 platform:	HSW
i915 features:	GPU hang

Attachments
the card error state (24.71 KB, text/plain) 2019-09-23 20:01 UTC, Mihai Dontu	no flags	Details
dmesg (543.52 KB, text/plain) 2019-09-23 20:03 UTC, Mihai Dontu	no flags	Details
lspci (4.92 KB, text/plain) 2019-09-23 20:04 UTC, Mihai Dontu	no flags	Details
lspci -v (6.76 KB, text/plain) 2019-09-23 20:06 UTC, Mihai Dontu	no flags	Details
[drm] GPU crash dump saved to /sys/class/drm/card0/error (26.28 KB, text/plain) 2019-09-29 14:35 UTC, Alexander	no flags	Details
Show Obsolete (1) View All

Description Mihai Dontu 2019-09-23 20:01:55 UTC

Created attachment 145476 [details]
the card error state

I use synergy to control my desktop from my laptop, and simply moving the mouse from the remote screen to the local one, triggered this:

Sep 23 22:46:30 mdontu-l kernel: DMAR: DRHD: handling fault status reg 3
Sep 23 22:46:30 mdontu-l kernel: DMAR: [DMA Write] Request device [00:02.0] fault addr 94f99000 [fault reason 05] PTE Write access is not set
Sep 23 22:48:51 mdontu-l kernel: i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0
Sep 23 22:48:51 mdontu-l kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Sep 23 22:48:51 mdontu-l kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Sep 23 22:48:51 mdontu-l kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Sep 23 22:48:51 mdontu-l kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Sep 23 22:48:51 mdontu-l kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
Sep 23 22:48:51 mdontu-l kernel: i915 0000:00:02.0: Resetting chip for hang on rcs0

It has happened three or four times before and only with Linux 5.3.x. Linux 5.2 did not exhibit this behaviour. I use suspend to RAM quite often. Maybe that plays a role. Anyway, it does not bother me. The screen is frozen for a few seconds and then everything is back to normal.

Comment 1 Mihai Dontu 2019-09-23 20:03:15 UTC

Created attachment 145477 [details]
dmesg

My dmesg has some IOMMU-related errors, but it has been doing that for the last year or so. No noticeable problem so far.

Comment 2 Mihai Dontu 2019-09-23 20:04:00 UTC

Created attachment 145478 [details]
lspci

Here is the output of lspci -v, just in case.

Comment 3 Mihai Dontu 2019-09-23 20:06:54 UTC

Created attachment 145479 [details]
lspci -v

Comment 4 Chris Wilson 2019-09-23 20:29:28 UTC

IPEHR: 0x780c0000

Empirical evidence suggests that context restore hang is related to iommu.

Comment 5 Alexander 2019-09-29 14:35:08 UTC

Created attachment 145578 [details]
[drm] GPU crash dump saved to /sys/class/drm/card0/error

Comment 6 Alexander 2019-09-29 14:40:47 UTC

I am experiencing similar problems to Mihai above.
My system (Arch Linux with Windows 10 guest, using IOMMU GPU passthrough) has been working for over a year, but started freezing on the Host side a few days ago (after updates).

dmesg: 
[  812.306423] i915 0000:00:02.0: GPU HANG: ecode 7:0:0x00000000, hang on rcs0
[  812.306424] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  812.306436] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  812.306436] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  812.306436] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  812.306436] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  812.306639] i915 0000:00:02.0: Resetting chip for hang on rcs0

Last line: "Resetting chip for hang on rcs0" is repeated a lot, which comes with symptoms of system hangs for ~3s every 10-15s. Especially when the QEMU Guest is running and the host system is in use (mouse movement).
Also, internet functionality is impaired - recovers itself after a few seconds of wait usually.

Comment 7 Alexander 2019-10-02 16:44:49 UTC

As my system is basically unusable (because of frequent freezes, which also impact the internet stability as side-effect), I tried the Arch Linux LTS Kernel (4.19.75-1-lts) and there seems to be no bug. 
-> No freezes for half an hour with Windows 10 guest running (with the newest Kernel, the freezing starts even without starting the guest system after a few minutes).

Comment 8 JPT 2019-11-04 13:48:33 UTC

Hi,

I've got the problem that whenever I do disk traffic, after a few minutes the GUI becomes laggy. Soon it locks completely.
Sometimes the GUI simply locks, without any background work done.
I once got the mentioned GPU HANG message but I didn't fetch the error log.
So I am still trying to produce this error message again. error log still is empty.

I know of the problems with Linux and swap, but this is just far worse and doesn't happen at all on my laptop with very similar setup

how to reproduce:
- just read or write a lot of data, let's say 100 GB via DD, CP, GUI... as you like
- linux cache size increases until it starts swapping
- SWAPOUT is around 10 to 100 thousend pages (per atop cycle I believe)
- after a few minutes the GUI is locked.
- when the job is finished the gui suddenly is ok again. and this completely without the usual lagging known from a swapping system.

my environment
- Kubuntu 19.10
- Kernel 5.3.0-19
- Intel i5-9600K

I already tried, without success:
- fresh installation of Kubuntu on a new disk
- new mouse
- vm.swappiness = 10 and vm.vfs_cache_pressure=50
- uninstalling xorg-intel driver

BTW, as alex wrote, I believe this problem disappeared with kernel 4.15.0
another remark. I usually don't have any swap, because of the bad behavior of Linux when swapping. I only do have swap now, because else in the exact same circumstances as above kswapd went amok instead, making the system even more unstable/unusable.

Alex, could you just try what happens with your GUI when you produce much disk traffic?

Comment 9 Alexander 2019-11-04 20:02:46 UTC

Hello JPT,

high disk usage seems to do nothing bad when using kernel 4.19.81-1-lts.
Using the latest kernel gives me problems, regardless what I do.

Running a VM with GPU passthrough makes things bad (resets and errors) - but running a VM comes with considerable DISK IO - esp. when running games.

Is there anything specific you'd need me to do?

Best Regards,
Alexander

Comment 10 JPT 2019-11-05 13:18:12 UTC

I checked again and yes, my problem vanishes if I boot linux-image-4.15.0-1050-oem which is the only 4.x kernel available in Ubuntu 19.10 repo.

With Kernel 4 I get the expected behavior of a swapping system. It lags and is hard to use in times. But it's not completely locked for several minutes.

Does this proof that it's NOT Freedesktop/xorg related?

How long has the GUI to hang until I should get the above error message in the syslog?

Hello Alexander

I had a problem with VirtualBox as well. 
My host always crashed because of kswapd running amok after a few minutes of running the VM. My solution was to add a small swap space (1 GB). Since then the PC is suffering from the GUI lock mentioned here. 

so, do you have SWAP? What happens with the VM if you remove all swap space of the host?
did you ever notice kswapd taking 100% cpu?
what do you mean with GPU passthrough? is this VMware specific?

kind regards

Jan

Comment 11 SimonWittwer 2019-11-17 10:51:34 UTC

Same issue here

Hangs every 15-20 sec
arch x86_64
kernel 5.3.11-arch1-1
journalctl -xe:
i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

downgrade to kernel 5.3.10-arch1-1 and i have no freezes

Comment 12 Martin Peres 2019-11-29 19:33:59 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/446.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.