Bug 110628 - [gen3] [drm] GPU HANG: ecode 3:1:0x00000000, in Xorg [991], hang on rcs0.
Summary: [gen3] [drm] GPU HANG: ecode 3:1:0x00000000, in Xorg [991], hang on rcs0.
Status: RESOLVED MOVED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: lowest minor
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-05-06 16:05 UTC by Vasily Galkin
Modified: 2019-11-29 19:07 UTC (History)
1 user (show)

See Also:
i915 platform: I945G
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error (with dump!), Xorg log and dmesg with drm.debug=0x1e (130.00 KB, application/x-tar)
2019-05-06 16:05 UTC, Vasily Galkin
no flags Details
journalctl priority 7 (1.08 MB, application/zip)
2019-05-31 09:04 UTC, jshand2013
no flags Details

Description Vasily Galkin 2019-05-06 16:05:22 UTC
Created attachment 144176 [details]
/sys/class/drm/card0/error (with dump!), Xorg log and dmesg with drm.debug=0x1e

Every several weeks I'm getting GPU hang in kernel log.
>i915 0000:00:02.0: GPU HANG: ecode 3:1:0x00000000, in Xorg [991], hang on rcs0
>...
>i915 0000:00:02.0: Resetting chip for hang on rcs0
(see attached archive for more messages)


The last time it eventually generated a non-empty dump in /sys/class/drm/card0/error

The hang typically reproduces at night (when office machine is not at use) and in the morning all apps continues working fine. There is no visual effect so the severity isn't high.

There were previous hang messages in bug #109047 but hangs were a bit different, and error file was lacking the real dump, so I'm closing that bug and opening new one - now with actual (base64(?)) dump in error file.

System environment:
-- chipset: i945g (82945G/GZ Integrated Graphics Controller 8086:2772)
-- system architecture: 64-bit
-- Linux distribution:Debian buster
-- xf86-video-intel: 2:2.99.917+git20180925-2
-- xserver:2:1.20.3-1
-- mesa:18.3.4-2
-- libdrm:2.4.95-1
-- kernel:5.1-rc1-based drm-tip 2019y-03m-18d-22h-01m-31s UTC integration manifest
-- Mobo model:DMI: Gigabyte Technology Co., Ltd. GC330UD, BIOS F2 03/17/2009
-- Display connector:d-sub

The exact kernel is ubuntu drm-tip binaries from https://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/2019-03-19/

Source: https://git.launchpad.net/~ubuntu-kernel-test/ubuntu/+source/linux/+git/mainline-crack/commit/?id=7f60fa0e

drm-tip: 2019y-03m-18d-22h-01m-31s UTC integration manifest
commit 7f60fa0e
Comment 1 Chris Wilson 2019-05-06 17:41:52 UTC
IPEHR gives part of the pixel shader program and not the instruction itself -- suggests something went wrong with the parsing. Coherency? Cachelines? Fencing?

The other possibility is the state setup throws off the parser, e.g. a tiled surface not aligned. It is only using 0x00900000 as the X tiled destination, and 0x07000000 as the Y tiled source (glyph atlas). So it looks fine, unless we've missed an 945g peculiarity (and tiling is indeed peculiar).
Comment 2 Lakshmi 2019-05-31 08:49:35 UTC
@Vasily, Results from latest drmtip (https://cgit.freedesktop.org/drm-tip) would be helpful in this case. Can you please reproduce this issue with latest drmtip?
Comment 3 jshand2013 2019-05-31 09:04:51 UTC
Created attachment 144397 [details]
journalctl priority 7
Comment 4 Vasily Galkin 2019-05-31 10:01:58 UTC
@jshand2013

According to log - I'm not sure that your problem is related to this bug: it refers to a much newer HW then gen3 (DP present in connectors) and the HANG is triggered after org_kde_powerdevil tries to do something with power-saving.
Comment 5 Lakshmi 2019-06-14 08:21:18 UTC
Vasily, any results from drmtip testing?
Comment 6 Vasily Galkin 2019-06-14 09:38:08 UTC
I didn't perform testing on more fresh than 5.1rc yet. I'll notify if I test more fresh kerenl.

However it looks that it is harmless - I've seen it ~10 times (for the the last several months) now and it never even crashed the browser which uses HW acccel. 

The biggest seen visible problem is screen flicker. So I'm lowering importance level

Several years ago similar problem made me mad due to completely hanging the system every 2-3 weeks. Now the situation is MUCH better. Thank you for still supporting gen3!
Comment 7 Martin Peres 2019-11-29 19:07:12 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/291.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.