110135 – [skl] GPU hang in sway, unknown cause

Bug 110135 - [skl] GPU hang in sway, unknown cause

Summary: [skl] GPU hang in sway, unknown cause

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	medium normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2019-03-15 20:39 UTC by Jeff Peeler
Modified:	2019-09-25 20:32 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments
crash dump output (23.55 KB, application/x-bzip) 2019-03-15 20:39 UTC, Jeff Peeler	Details
crash dump output (19.10 KB, application/x-bzip) 2019-03-15 20:44 UTC, Jeff Peeler	Details
i965: make sure to have cs stall before vf cache invalidate (2.42 KB, patch) 2019-03-17 11:57 UTC, Lionel Landwerlin	Details \| Splinter Review
crash dump 3 (23.26 KB, application/x-bzip) 2019-03-20 15:59 UTC, Jeff Peeler	Details
sway config (11.68 KB, text/plain) 2019-03-22 17:17 UTC, Jeff Peeler	Details
output from drm_info (58.30 KB, text/plain) 2019-04-12 00:46 UTC, Jeff Peeler	Details
sway debug log (41.95 KB, application/gzip) 2019-04-12 00:48 UTC, Jeff Peeler	Details
crash dump output while using gnome (39.21 KB, text/plain) 2019-05-22 17:18 UTC, Jeff Peeler	Details
View All

Description Jeff Peeler 2019-03-15 20:39:47 UTC

Created attachment 143685 [details]
crash dump output

[21763.262354] [drm] GPU HANG: ecode 9:0:0x86cdffff, in sway [26607], reason: hang on rcs0, action: reset
[21763.262356] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[21763.262356] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[21763.262357] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[21763.262357] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[21763.262358] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[21763.263378] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Comment 1 Jeff Peeler 2019-03-15 20:44:28 UTC

Crash from 2 days ago (perhaps seeing two crashes is helpful?)

[15600.080449] [drm] GPU HANG: ecode 9:0:0x85dffffb, in sway [2338], reason: hang on rcs0, action: reset
[15600.080453] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[15600.080454] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[15600.080456] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[15600.080457] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[15600.080460] [drm] GPU crash dump saved to /sys/class/drm/card1/error
[15600.081554] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[18606.079498] perf: interrupt took too long (3136 > 3132), lowering kernel.perf_event_max_sample_rate to 63000
[30631.655904] nouveau 0000:01:00.0: disp: 0x000064a8[0]: INIT_GENERIC_CONDITON: unknown 0x07
[30631.695021] nouveau 0000:01:00.0: disp: 0x000064a8[0]: INIT_GENERIC_CONDITON: unknown 0x07
[40061.910500] perf: interrupt took too long (4001 > 3920), lowering kernel.perf_event_max_sample_rate to 49000

Comment 2 Jeff Peeler 2019-03-15 20:44:46 UTC

Created attachment 143686 [details]
crash dump output

Comment 3 Lionel Landwerlin 2019-03-16 21:10:05 UTC

What kernel & mesa version are you running?

Comment 4 Jeff Peeler 2019-03-17 00:39:42 UTC

Kernel: 4.20.14-200.fc29.x86_64
LibDRM: 2.4.97
Mesa: 18.3.4

Device: Mesa DRI Intel(R) HD Graphics 530 (Skylake GT2)  (0x191b)

Comment 5 Lionel Landwerlin 2019-03-17 11:57:04 UTC

Created attachment 143703 [details] [review]
i965: make sure to have cs stall before vf cache invalidate

Is there any way you could give this patch a try?

How frequent are the hangs?

Thanks a lot!

Comment 6 Jeff Peeler 2019-03-19 01:39:31 UTC

The hangs aren't that often, so a week of testing may be necessary to determine if this fixes the issue.

I applied the patch to the RPM sources in Fedora 29 and it applied cleanly to 18.3.4. The fedora mesa package is split into a number of different RPM packages. Do you know if I can just update the mesa-dri-drivers package or do I need to update all of them? For reference, the files in that package are:

/usr/lib64/dri/i915_dri.so
/usr/lib64/dri/i965_dri.so
/usr/lib64/dri/kms_swrast_dri.so
/usr/lib64/dri/nouveau_dri.so
/usr/lib64/dri/nouveau_drv_video.so
/usr/lib64/dri/nouveau_vieux_dri.so
/usr/lib64/dri/r200_dri.so
/usr/lib64/dri/r300_dri.so
/usr/lib64/dri/r600_dri.so
/usr/lib64/dri/r600_drv_video.so
/usr/lib64/dri/radeon_dri.so
/usr/lib64/dri/radeonsi_dri.so
/usr/lib64/dri/radeonsi_drv_video.so
/usr/lib64/dri/swrast_dri.so
/usr/lib64/dri/virtio_gpu_dri.so
/usr/lib64/dri/vmwgfx_dri.so
/usr/lib64/gallium-pipe
/usr/lib64/gallium-pipe/pipe_nouveau.so
/usr/lib64/gallium-pipe/pipe_r300.so
/usr/lib64/gallium-pipe/pipe_r600.so
/usr/lib64/gallium-pipe/pipe_radeonsi.so
/usr/lib64/gallium-pipe/pipe_swrast.so
/usr/lib64/gallium-pipe/pipe_vmwgfx.so
/usr/share/drirc.d
/usr/share/drirc.d/00-mesa-defaults.conf

Comment 7 Jeff Peeler 2019-03-20 02:22:52 UTC

I've just upgraded everything. Will report back sometime next week.

Comment 8 Jeff Peeler 2019-03-20 15:59:31 UTC

Created attachment 143741 [details]
crash dump 3

Comment 9 Jeff Peeler 2019-03-20 16:00:12 UTC

Reporting back faster than I thought, the problem remains:

[56521.372029] [drm] GPU HANG: ecode 9:0:0x85dffffb, in sway [2371], reason: hang on rcs0, action: reset
[56521.373094] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0

Attached error log above.

Comment 10 Denis 2019-03-22 12:44:55 UTC

hi, what do you usually do when hang occurs? Any apps in use, or just navigating in system?

Comment 11 Jeff Peeler 2019-03-22 12:50:00 UTC

Seems to occur mostly when interacting with video. I know it's happened while using Bluejeans video conferencing, but I may have seen it while watching videos in the browser too.

Comment 12 Denis 2019-03-22 16:36:56 UTC

I tried to install sway on my SKL, using fedora 29 (KDE).
I used this manual 
https://nationpigeon.com/compiling-sway-on-fedora-29/

I could login into the sway, and the only thing I see - is a clock (working clock).

According to the man pages - https://github.com/swaywm/sway/wiki#i-just-installed-sway-i-can-move-my-mouse-cursor-but-my-keyboard-does-not-work
I added default man page into .config/sway/config , but it didn't help. As I understood, I need to setup all bindings manually, that's correct? If so, it would be helpful from you to provide your config file, so I can reuse it.

Comment 13 Jeff Peeler 2019-03-22 17:17:43 UTC

Created attachment 143758 [details]
sway config

Here's my sway config. You really just need to make sure you have your terminal short cut set up correctly.

Comment 14 Denis 2019-03-25 13:50:28 UTC

aha, thanks for clarification. I sorted out my problem, ran cmd and connected to network.
Now I am running firefox-wayland for about 2 hours (in one tab - youtube video, in another tab - webGL app (aquarium)).
Still nothing, waiting more.

Comment 15 Jeff Peeler 2019-03-25 14:04:46 UTC

One thing I just realized is that the crash DID go away. I should not have used the word crash in the latest upload. I assumed the patch would fix the hang too. Is the patch supposed to fix the crash, hang, or both?

Denis, thanks for your reproducing efforts. If there's any additional debug I can do since it sounds like your system is not behaving in the same way let me know.

Comment 16 Lionel Landwerlin 2019-03-25 14:22:45 UTC

(In reply to Jeff Peeler from comment #15)
> One thing I just realized is that the crash DID go away. I should not have
> used the word crash in the latest upload. I assumed the patch would fix the
> hang too. Is the patch supposed to fix the crash, hang, or both?

The patch would only help with the hang.

> 
> Denis, thanks for your reproducing efforts. If there's any additional debug
> I can do since it sounds like your system is not behaving in the same way
> let me know.

Comment 17 Denis 2019-03-26 11:26:13 UTC

(In reply to Jeff Peeler from comment #15)
> One thing I just realized is that the crash DID go away. I should not have
> used the word crash in the latest upload. I assumed the patch would fix the
> hang too. Is the patch supposed to fix the crash, hang, or both?
> 
> Denis, thanks for your reproducing efforts. If there's any additional debug
> I can do since it sounds like your system is not behaving in the same way
> let me know.

I don't know, our configurations look very similar. Did you build sway from git or took from repository?
What browser do you use?

Comment 18 Jeff Peeler 2019-03-26 19:59:20 UTC

I'm using Firefox 65 and built sway from the 1.0 tag. A 1.0 release of sway doesn't exist in Fedora (yet).

Comment 19 Jeff Peeler 2019-04-09 15:32:05 UTC

Given that the patch did seem to help some, is this something that is going to be merged or is there anything else I can do to push this along?

Comment 20 emersion 2019-04-10 07:28:55 UTC

Hi, sway dev here. i915 devs: let me know if you need info about userspace.

Our DRM code behaves more or less like Weston, so I'm not sure what could be wrong here.

However I see that the Intel card is card1 and there are nouveau logs. What is your setup exactly? Is card0 a NVIDIA card? You could also e.g. run https://github.com/ascent12/drm_info on your device to get information about what cards are plugged in and what are their capabilities.

On multi-GPU setups we render on one primary GPU and use DMA-BUFs to copy buffers from the primary GPU to the secondary one (we don't do direct scan-out yet, this really is a copy).

If you could share sway debug logs (sway -d >sway.log 2>&1) that would help figuring out the exact setup sway/wlroots runs on.

If you connect all monitors to one card, you could force sway/wlroots to use only this card. This might or might not help, and will hide connectors of the other cards. You can do so by exporting WLR_DRM_DEVICES=/dev/dri/card0 (or card1).

Comment 21 Jeff Peeler 2019-04-12 00:46:11 UTC

According to drm_info (how else would I have found this information?), card0 is i915 and card1 is nouveau. I set the graphics to hybrid mode instead of discrete as I was told the former was more stable.

I have a somewhat complex set up with a laptop in a dock. The dock connects to two external monitors and I have third external monitor plugged directly into the laptop (I couldn't get the third screen working otherwise).

I tried the commands you suggested to direct sway to use just one GPU, but what ended up happening is with card0 enabled I just had my laptop screen and with card1 just the 3 external monitors.

Comment 22 Jeff Peeler 2019-04-12 00:46:42 UTC

Created attachment 143942 [details]
output from drm_info

Comment 23 Jeff Peeler 2019-04-12 00:48:48 UTC

Created attachment 143943 [details]
sway debug log

Comment 24 Jeff Peeler 2019-05-22 17:17:38 UTC

After a system upgrade, the bug still persists. The title should probably be changed, as I see the issue in Gnome on Wayland too.

Kernel: 5.0.16-300.fc30.x86_64
LibDRM: 2.4.98
Mesa: 19.0.4

Wed May 22 11:33:18 2019] [drm] GPU HANG: ecode 9:0:0x84dfdffb, in gnome-shell [5332], reason: hang on rcs0, action: reset
[Wed May 22 11:33:18 2019] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[Wed May 22 11:33:18 2019] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[Wed May 22 11:33:18 2019] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[Wed May 22 11:33:18 2019] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[Wed May 22 11:33:18 2019] [drm] GPU crash dump saved to /sys/class/drm/card0/error

Comment 25 Jeff Peeler 2019-05-22 17:18:31 UTC

Created attachment 144324 [details]
crash dump output while using gnome

Comment 26 GitLab Migration User 2019-09-25 20:32:44 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1801.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.