Bug 109047 - [gen3] [drm] GPU HANG: ecode 3:0:7ee06741 in Chrome_InProcGp, hang on rcs0, reset
Summary: [gen3] [drm] GPU HANG: ecode 3:0:7ee06741 in Chrome_InProcGp, hang on rcs0, r...
Status: REOPENED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium minor
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2018-12-12 20:26 UTC by Vasily Galkin
Modified: 2019-04-17 17:54 UTC (History)
2 users (show)

See Also:
i915 platform: I945G
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error (11.37 KB, text/plain)
2018-12-12 20:26 UTC, Vasily Galkin
no flags Details
full dmesg with default debug level (61.11 KB, text/plain)
2018-12-12 20:27 UTC, Vasily Galkin
no flags Details
Compressed logs of crash on 4.20 with drm.debug=0x1e (1.09 MB, application/x-xz)
2019-01-14 17:11 UTC, Vasily Galkin
no flags Details
output of journalctl -o short-monotonic | grep -C 30 i915_reset_device (28.77 KB, text/plain)
2019-01-14 17:29 UTC, Vasily Galkin
no flags Details
/sys/class/drm/card0/error on 4.20 (4.46 KB, text/plain)
2019-02-07 12:36 UTC, Vasily Galkin
no flags Details
/sys/class/drm/card0/error on drm-tip 5.1-rc1 (4.48 KB, text/plain)
2019-04-17 17:51 UTC, Vasily Galkin
no flags Details
Xorg.0.log on drm-tip 5.1-rc1 (33.28 KB, text/plain)
2019-04-17 17:52 UTC, Vasily Galkin
no flags Details
Compressed logs of boot & crash with drm.debug=0x1e on drm-tip 5.1-rc1 (3.83 MB, application/octet-stream)
2019-04-17 17:54 UTC, Vasily Galkin
no flags Details

Note You need to log in before you can comment on or make changes to this bug.
Description Vasily Galkin 2018-12-12 20:26:27 UTC
Created attachment 142794 [details]
/sys/class/drm/card0/error

After using a chrome-based browser with gles2-based acceleration running for several weeks with lot of tabs on a machine with small amount of memory the screen flickers and the gpu hang appeared in dmesg.

From the user point of view the problem is minor since the gpu recovered finely, even browser is still working. This is mostly for sharing error state text for a case that similar problem affects other users with a non-recoverable hangs.

GPU HANG: ecode 3:0:0x7ee06741, in Chrome_InProcGp [7079], reason: hang on rcs0, action: reset

System environment:
-- chipset: i945g (82945G/GZ Integrated Graphics Controller 8086:2772)
-- system architecture: 64-bit
-- xf86-video-intel: 2:2.99.917+git20180925-2
-- xserver:2:1.20.3-1
-- mesa:18.2.5-3
-- libdrm:2.4.95-1
-- kernel:4.19.0-rc7-amd64
-- Linux distribution:debian testing
-- Mobo model:DMI: Gigabyte Technology Co., Ltd. GC330UD, BIOS F2 03/17/2009
-- Display connector:d-sub

Browser opera 59.0.3154.0 was started with lots of non-default arguments including --use-gl=egl that enables glesv2 backend (well, to make it usably fast on i945)

opera-developer --no-zygote --use-gl=egl --enable-zero-copy --enable-native-gpu-memory-buffers --disable-gpu-sandbox --in-process-gpu --ui-disable-partial-swap --ui-enable-zero-copy --disable-gpu-driver-bug-workarounds --enable-features=CheckerImaging,UseSkiaRenderer,SkiaDeferredDisplayList --enable-media-suspend --enable-background-timer-throttling --enable-prefer-compositing-to-lcd-text --no-sandbox --no-pings --use-skia-deferred-display-list --use-skia-renderer --renderer-process-limit=1 --limit-fps=15

Actually this bug is on the same physical machine that was https://bugs.freedesktop.org/show_bug.cgi?id=92732

The linked bug was 3 years ago with odd mix of 32-bit userspace on 64bit kernel. 
Such mix was crashing every 2-3 weeks.

Now all is 64 bit and MUCH more stable - first hang in 3 months.
So, despite of this bug, generally i945g with gles2 is still in a good shape with current kernel and mesa from the user point of view.
Comment 1 Vasily Galkin 2018-12-12 20:27:49 UTC
Created attachment 142795 [details]
full dmesg with default debug level
Comment 2 Vasily Galkin 2018-12-12 20:41:58 UTC
While the bug is similar to closed bug about gen3 hanging  https://bugs.freedesktop.org/show_bug.cgi?id=90841

It doesn't seem to be a duplicate, since the fix was done in 2016, and current bug was observed on quite new software.
Comment 3 Jani Saarinen 2018-12-12 20:49:56 UTC
Hi, 
Might be wishful thinking but our drm-tip is  4.20.0-rc5 today, do you mind
try to reproduce the error using drm-tip (https://cgit.freedesktop.org/drm-tip) and kernel parameters drm.debug=0x1e log_buf_len=4M, and if the problem persists attach the full dmesg from boot.
Comment 4 Vasily Galkin 2019-01-14 17:11:45 UTC
Created attachment 143111 [details]
Compressed logs of crash on 4.20 with drm.debug=0x1e

I've installed ubuntu's build of drm-tip

from https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2018-12-13/

version 
4.20.0-994-generic #201812122102 SMP Thu Dec 13 02:04:59 UTC 2018 x86_64 GNU/Linux

compiled from cod/tip/drm-tip/2018-12-13 (1f86f1fb70f082ed93450c328e518d8013d23953 - 2018y-12m-13d-01h-20m-07s UTC integration manifest)

And booted with drm.debug=0x1e log_buf_len=4M

After uptime of 21.5 days the similar GPU hang reproduces:
[1851925.030801] vgalkin-desktop kernel: [drm:intel_power_well_enable [i915]] enabling always-on
[1851929.100893] vgalkin-desktop kernel: [drm:intel_power_well_disable [i915]] disabling always-on
[1851949.683477] vgalkin-desktop kernel: [drm:intel_power_well_enable [i915]] enabling always-on
[1851953.132869] vgalkin-desktop kernel: [drm:intel_power_well_disable [i915]] disabling always-on
[1851964.572983] vgalkin-desktop kernel: [drm:intel_power_well_enable [i915]] enabling always-on
[1851973.028999] vgalkin-desktop kernel: [drm] GPU HANG: ecode 3:0:0x407bcfc5, in Xorg [910], reason: hang on rcs0, action: reset
[1851973.029007] vgalkin-desktop kernel: [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[1851973.029010] vgalkin-desktop kernel: [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[1851973.029013] vgalkin-desktop kernel: [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[1851973.029015] vgalkin-desktop kernel: [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[1851973.029018] vgalkin-desktop kernel: [drm] GPU crash dump saved to /sys/class/drm/card0/error
[1851973.029262] vgalkin-desktop kernel: [drm:i915_reset_device [i915]] resetting chip
[1851973.029384] vgalkin-desktop kernel: [drm:drm_atomic_state_init [drm]] Allocated atomic state 000000004846a8c9

Nobody used computer at that moment, so the visual effect at the time of bug is unknown (I think the monitor was in standby mode when hang appeared).
When I see it after several hours all was looking&working fine.

The 21-day log with drm.debug=0x1e it too huge, so I'm attaching several compressed logs inside tar.xz:
at the boottime: hang-on-4.20-boot-logs.txt 
several hours before hang and hang: hang-on-4.20-hang-logs.txt
and error state: hang-on-4.20-drm-card0-error.txt
Comment 5 Vasily Galkin 2019-01-14 17:29:38 UTC
Created attachment 143112 [details]
output of journalctl -o short-monotonic | grep -C 30 i915_reset_device

After first hang with generating the error state, the
[drm:i915_reset_device [i915]] resetting chip message is appearing nearly 1 time a day - see attached log. (I didn't rebooted yet).

Two times it is immediately following the message "[1855512.225483] vgalkin-desktop barriers[24178]: Barrier 2.2.0-Release: [2019-01-11T03:36:07] NOTE: client "LURAT-PC" is dead"

Barrier is keyboard-mouse-switcher-like software and I think that after outputting this message it shows previously invisible mouse cursor and maybe awaking monitor from sleep.
Comment 6 Lakshmi 2019-02-07 11:51:37 UTC
(In reply to Vasily Galkin from comment #5)
> Created attachment 143112 [details]
> output of journalctl -o short-monotonic | grep -C 30 i915_reset_device
> 
> After first hang with generating the error state, the
> [drm:i915_reset_device [i915]] resetting chip message is appearing nearly 1
> time a day - see attached log. (I didn't rebooted yet).
> 
> Two times it is immediately following the message "[1855512.225483]
> vgalkin-desktop barriers[24178]: Barrier 2.2.0-Release:
> [2019-01-11T03:36:07] NOTE: client "LURAT-PC" is dead"
> 
> Barrier is keyboard-mouse-switcher-like software and I think that after
> outputting this message it shows previously invisible mouse cursor and maybe
> awaking monitor from sleep.

Can you attach GPU crash dump /sys/class/drm/card0/error?
Have you tried with latest drmtip? (https://cgit.freedesktop.org/drm-tip)
Comment 7 Vasily Galkin 2019-02-07 12:36:22 UTC
Created attachment 143322 [details]
/sys/class/drm/card0/error on 4.20

Forgot to mention that /sys/class/drm/card0/error was already included in attachment 143111 [details]: Compressed logs of crash on 4.20 with drm.debug=0x1e

However now attaching it for simplicity as separate file (and obsoleting all attachments from non-last reproduction).

About testing on drm-tip:
note that the problem typically reproduces only after several weeks of uptime - so it is always "nearly 1 kernel release late" - for example my last test was from drm-tip/2018-12-13

Since testing time is so long - it may need some planning of what version start to test - it may be more useful to wait 1-2-3 weeks and test some "new release full of changes" than restarting testing with "nearly same code".

Does current drm-tip since 2018-12-13 include any changes that may affect this bug or it's better to wait for some future pull?
Comment 8 Lakshmi 2019-02-07 13:56:14 UTC
(In reply to Vasily Galkin from comment #7)
> Created attachment 143322 [details]
> /sys/class/drm/card0/error on 4.20


(In reply to Vasily Galkin from comment #7)
> Created attachment 143322 [details]
> /sys/class/drm/card0/error on 4.20
> 
> Forgot to mention that /sys/class/drm/card0/error was already included in
> attachment 143111 [details]: Compressed logs of crash on 4.20 with
> drm.debug=0x1e
> 
> However now attaching it for simplicity as separate file (and obsoleting all
> attachments from non-last reproduction).

Thanks for attaching the error file. There are no clues in the attached error file.
As you said in the bug description, is that the only way to reproduce the hang? If so, then the error file is also from the same scenario. Can you attach Xorg.0.log?

> 
> About testing on drm-tip:
> note that the problem typically reproduces only after several weeks of
> uptime - so it is always "nearly 1 kernel release late" - for example my
> last test was from drm-tip/2018-12-13
> 
> Since testing time is so long - it may need some planning of what version
> start to test - it may be more useful to wait 1-2-3 weeks and test some "new
> release full of changes" than restarting testing with "nearly same code".
> 
> Does current drm-tip since 2018-12-13 include any changes that may affect
> this bug or it's better to wait for some future pull?

There will be quiet a many changes going to drmtip regularly, so we always recommend to use latest drmtip, logs from that will help during investigation.
Comment 9 Vasily Galkin 2019-02-07 14:36:18 UTC
Unfortunately Xorg log is already lost. I'll attach it if the bug reproduces another time.

About scenario - the first time (when bug was reported initially with pre-4.20 kernels) I opened a lot of new tabs chromium-based browser seconds before the problem. And the hang was in request from browser process.

The second time (current 4.20 attachments) - nothing was done at all, the "office" machine was staying "locked&unused" during night. And the hang was in request from Xorg process.
Comment 10 Lakshmi 2019-03-07 12:45:19 UTC
Reporter, any updates from drmtip? I close this bug if the issue not seen on drmtip.
Comment 11 Vasily Galkin 2019-03-07 13:32:50 UTC
I didn't test drm-tip yet, closing by now.

Well, initially I was afraid that I'll see "flicker-or-*freezing* every 2 weeks" problem as I saw earlier years with 32-bit kernel, but it turns out that even if reproduces - I've seen this bug 4 times now - it always just flickers and continues working fine. 

I'll reopen and report if issue reproduces with more fresh drm-tip.
Comment 12 Vasily Galkin 2019-04-17 17:47:26 UTC
The problem reproduced with 5.1-rc1-based drm-tip
drm-tip: 2019y-03m-18d-22h-01m-31s UTC integration manifest
commit 7f60fa0e

Source: https://git.launchpad.net/~ubuntu-kernel-test/ubuntu/+source/linux/+git/mainline-crack/commit/?id=7f60fa0e

I used ubuntu drm-tip binaries from https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2019-03-19/

Similar to previous hang it occured at night when machine wasn't used, but browser window was left open, so some rendering may occured. Despite of GPU error the system continues working fine.

Now I saved (still running) Xorg.0.log, but it doesn't contain anything at the hang time.
However, the main error message was a bit different:
> i915 0000:00:02.0: GPU HANG: ecode 3:1:0x00000000, in Xorg [877], hang on rcs0

Going to attach it with fresh logs.
Comment 13 Vasily Galkin 2019-04-17 17:51:43 UTC
Created attachment 144022 [details]
/sys/class/drm/card0/error on drm-tip 5.1-rc1
Comment 14 Vasily Galkin 2019-04-17 17:52:49 UTC
Created attachment 144023 [details]
Xorg.0.log on drm-tip 5.1-rc1
Comment 15 Vasily Galkin 2019-04-17 17:54:12 UTC
Created attachment 144024 [details]
Compressed logs of boot & crash with drm.debug=0x1e on drm-tip 5.1-rc1


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.