106342 – [drm] HANG: ecode 9:0:0x9cba0f27, in kscreenlocker_g [103585], reason: Hang on rcs0, action: reset

Bug 106342 - [drm] HANG: ecode 9:0:0x9cba0f27, in kscreenlocker_g [103585], reason: Hang on rcs0, action: reset

Summary: [drm] HANG: ecode 9:0:0x9cba0f27, in kscreenlocker_g [103585], reason: Hang o...

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) All

Importance:	high major
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-05-02 02:36 UTC by Thiago Macieira
Modified:	2019-02-02 06:00 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	KBL, SKL
i915 features:	firmware/dmc, GPU hang, power/suspend-resume

Attachments
card0_error 2018-05-02 (69.29 KB, text/plain) 2018-05-02 02:36 UTC, Thiago Macieira	no flags	Details
card0_error 2018-05-11 (84.41 KB, text/plain) 2018-05-11 17:22 UTC, Thiago Macieira	no flags	Details
card0_error 2018-05-17 (81.99 KB, text/plain) 2018-05-18 00:45 UTC, Thiago Macieira	no flags	Details
card0_error 2018-05-29 (53.81 KB, text/plain) 2018-05-30 02:10 UTC, Thiago Macieira	no flags	Details
card0_error 2018-06-08 (46.61 KB, text/plain) 2018-06-09 01:46 UTC, Thiago Macieira	no flags	Details
card0_error 2018-06-22 (81.07 KB, text/plain) 2018-06-22 23:26 UTC, Thiago Macieira	no flags	Details
card0_error 2018-07-11 (69.26 KB, text/plain) 2018-07-12 02:38 UTC, Thiago Macieira	no flags	Details
card0_error 2018-07-20 (80.62 KB, text/plain) 2018-07-20 23:56 UTC, Thiago Macieira	no flags	Details
card0_error 2018-07-30 (81.46 KB, application/x-trash) 2018-08-03 00:13 UTC, Thiago Macieira	no flags	Details
card0_error 2018-08-02 (32.38 KB, text/plain) 2018-08-03 00:14 UTC, Thiago Macieira	no flags	Details
card0_error 2018-08-09 (82.51 KB, text/plain) 2018-08-09 16:26 UTC, Thiago Macieira	no flags	Details
card0_error 2018-08-30 (84.52 KB, text/plain) 2018-08-30 15:57 UTC, Thiago Macieira	no flags	Details
card0_error 2018-08-30 (second of the same day) (73.49 KB, text/plain) 2018-08-31 00:28 UTC, Thiago Macieira	no flags	Details
card0_error 2018-09-18 (49.62 KB, application/x-trash) 2018-09-29 00:29 UTC, Thiago Macieira	no flags	Details
card0_error 2018-09-21 (80.57 KB, text/plain) 2018-09-29 00:29 UTC, Thiago Macieira	no flags	Details
card0_error 2018-09-28 (83.28 KB, text/plain) 2018-09-29 00:30 UTC, Thiago Macieira	no flags	Details
card0_error 2018-10-03 (78.96 KB, text/plain) 2018-10-03 23:46 UTC, Thiago Macieira	no flags	Details
card0_error 2018-10-09 (70.26 KB, text/plain) 2018-10-10 01:41 UTC, Thiago Macieira	no flags	Details
card0_error 2018-10-23 (85.91 KB, text/plain) 2018-10-23 17:58 UTC, Thiago Macieira	no flags	Details
card0_error 2018-10-29 (80.49 KB, text/plain) 2018-10-29 23:36 UTC, Thiago Macieira	no flags	Details
card0_error 2018-11-02 (45.07 KB, text/plain) 2018-11-03 01:52 UTC, Thiago Macieira	no flags	Details
card0_error 2018-11-06 (62.27 KB, text/plain) 2018-11-07 00:18 UTC, Thiago Macieira	no flags	Details
card0_error 2018-11-16 (79.75 KB, text/plain) 2018-11-17 06:00 UTC, Thiago Macieira	no flags	Details
card0_error 2018-11-20 (21.18 KB, text/plain) 2018-11-21 01:35 UTC, Thiago Macieira	no flags	Details
card0_error 2018-12-01 (123.52 KB, text/plain) 2018-12-01 22:27 UTC, Thiago Macieira	no flags	Details
card0_error 2018-12-01 (123.52 KB, text/plain) 2018-12-01 22:28 UTC, Thiago Macieira	no flags	Details
card0_error 2018-12-06 (49.24 KB, text/plain) 2018-12-07 06:38 UTC, Thiago Macieira	no flags	Details
card0_error_2018-12-07_lenovo_S300 (165.63 KB, text/plain) 2018-12-07 09:40 UTC, Romek	no flags	Details
card0_error 2018-12-08 (79.76 KB, text/plain) 2018-12-08 19:59 UTC, Thiago Macieira	no flags	Details
card0_error 2018-12-13 (22.24 KB, text/plain) 2018-12-14 03:01 UTC, Thiago Macieira	no flags	Details
card0_error 2019-01-09 (22.17 KB, text/plain) 2019-01-10 00:36 UTC, Thiago Macieira	no flags	Details
View All

Description Thiago Macieira 2018-05-02 02:36:25 UTC

Created attachment 139259 [details]
card0_error 2018-05-02

Possibly related to Bug 101991 (which I reported), bug 104545 (which says was fixed by the same commit).

Bug 101991 was about a GPU hang after resuming from hibernation. That is still the problem I am having: after a few cycles of suspend-to-disk (hibernate) and resume, I get a GPU hang soon after resuming, if not immediately after.

Bug 101991 was reportedly fixed by SKL DMC 1.27, which is what I am now using (kernel 4.16.3):

[    4.106911] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_27.bin (v1.27)

Unlike Bug 101991, the screen is still responsive after hang, not frozen. But many OpenGL workloads stop working, to the point that desktop is unusable due to EIO errors happening. It's just good enough for me to cleanly reboot, as opposed to forcing it via Alt+SysRq. Applications are not actually crashing (no coredump created), but appear to be exiting with error by something inside Mesa.

dmesg log:
[217047.398083] [drm] GPU HANG: ecode 9:0:0x9cba0f27, in kscreenlocker_g [103585], reason: Hang on rcs0, action: reset
[217047.398085] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[217047.398085] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[217047.398086] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[217047.398086] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[217047.398087] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[217047.398104] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[217048.617889] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[217048.617933] i915 0000:00:02.0: Resetting chip after gpu hang
[217049.833883] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[217051.160111] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[217052.482897] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[217052.589836] i915 0000:00:02.0: Failed to reset chip

Attached the card0/error file.

Comment 1 Jani Saarinen 2018-05-02 06:38:40 UTC

Please send full dmesg with drm.debug=0x1e from boot to failure? 
Chris, Imre, any thoughts?

Comment 2 Thiago Macieira 2018-05-02 07:31:12 UTC

Do you really want 2 to 3 days worth of dmesg?

Comment 3 Chris Wilson 2018-05-02 07:39:00 UTC

(In reply to Thiago Macieira from comment #2)
> Do you really want 2 to 3 days worth of dmesg?

No. If we thought it would be relevant, it would be in the error state.

Comment 4 Thiago Macieira 2018-05-02 07:44:21 UTC

(In reply to Chris Wilson from comment #3)
> (In reply to Thiago Macieira from comment #2)
> > Do you really want 2 to 3 days worth of dmesg?
> 
> No. If we thought it would be relevant, it would be in the error state.

Then please reopen, as Jani closed by saying:

> Please send full dmesg with drm.debug=0x1e from boot to failure?

Comment 5 Chris Wilson 2018-05-02 07:56:25 UTC

One bit of information that would be useful here is https://patchwork.freedesktop.org/series/42550/ to differentiate between whether the new ELSP submission was loaded and the seqno write went astray or if it died without seeing the new request.

Comment 6 Jani Saarinen 2018-05-02 09:10:53 UTC

Thiago, what do you mean with reopen? What bug specifically?

Comment 7 Thiago Macieira 2018-05-02 16:01:03 UTC

(In reply to Jani Saarinen from comment #6)
> Thiago, what do you mean with reopen? What bug specifically?

This bug is in NEEDINFO state. That means you're expecting more information from me. If it's not the drm.debug=0x1e for 3 days, then what is it?

(In reply to Chris Wilson from comment #5)
> One bit of information that would be useful here is
> https://patchwork.freedesktop.org/series/42550/ to differentiate between
> whether the new ELSP submission was loaded and the seqno write went astray
> or if it died without seeing the new request.

Applying patches to the kernel means disabling secure boot. I'd rather not, but can do if nothing else solves it.

Comment 8 Jani Saarinen 2018-05-02 19:27:09 UTC

(In reply to Thiago Macieira from comment #7)
> (In reply to Jani Saarinen from comment #6)
> > Thiago, what do you mean with reopen? What bug specifically?
> 
> This bug is in NEEDINFO state. That means you're expecting more information
> from me. If it's not the drm.debug=0x1e for 3 days, then what is it?
Well I think we were but Chris do not need that info it seems. This bug was new and never been other state than new => need info: https://bugs.freedesktop.org/show_activity.cgi?id=106342
> 
> (In reply to Chris Wilson from comment #5)
> > One bit of information that would be useful here is
> > https://patchwork.freedesktop.org/series/42550/ to differentiate between
> > whether the new ELSP submission was loaded and the seqno write went astray
> > or if it died without seeing the new request.
> 
> Applying patches to the kernel means disabling secure boot. I'd rather not,
> but can do if nothing else solves it.

Comment 9 Thiago Macieira 2018-05-02 19:44:54 UTC

I don't understand you. So I'm marking as though you have all the info you need.

If you need more, set back to NEEDINFO and tell me what you need.

Comment 10 Thiago Macieira 2018-05-11 17:22:57 UTC

Created attachment 139501 [details]
card0_error 2018-05-11

dmesg:

[187374.488441] [drm] GPU HANG: ecode 9:0:0x8f5ea223, in chrome [4097], reason: Hang on rcs0, action: reset
[187374.488448] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[187374.488451] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[187374.488454] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[187374.488457] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[187374.488461] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[187374.488508] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[187375.703915] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[187375.703955] i915 0000:00:02.0: Resetting chip after gpu hang
[187376.920268] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[187378.243403] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[187378.590743] asynchronous wait on fence i915:X[2417]/0:15b40 timed out
[187379.566737] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[187379.678708] i915 0000:00:02.0: Failed to reset chip
[187888.334907] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

$ glxinfo 
name of display: :0
i965: Failed to submit batchbuffer: Input/output error

Comment 11 Chris Wilson 2018-05-11 17:27:45 UTC

(In reply to Thiago Macieira from comment #10)
> Created attachment 139501 [details]
> card0_error 2018-05-11

That one is a regular userspace hang.

With respect to the earlier hangs, we've just applied

commit 77dfedb5be03779f9a5d83e323a1b36e32090105
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 13:11:45 2018 +0100

    drm/i915/execlists: Use rmb() to order CSB reads
    
    We assume that the CSB is written using the normal ringbuffer
    coherency protocols, as outlined in kernel/events/ring_buffer.c:
    
        *   (HW)                              (DRIVER)
        *
        *   if (LOAD ->data_tail) {            LOAD ->data_head
        *                      (A)             smp_rmb()       (C)
        *      STORE $data                     LOAD $data
        *      smp_wmb()       (B)             smp_mb()        (D)
        *      STORE ->data_head               STORE ->data_tail
        *   }
    
    So we assume that the HW fulfils its ordering requirements (B), and so
    we should use a complimentary rmb (C) to ensure that our read of its
    WRITE pointer is completed before we start accessing the data.
    
    The final mb (D) is implied by the uncached mmio we perform to inform
    the HW of our READ pointer.

to drm-tip which may explain why we didn't drain ELSP.

Comment 12 Thiago Macieira 2018-05-11 22:04:05 UTC

(In reply to Chris Wilson from comment #11)
> (In reply to Thiago Macieira from comment #10)
> > Created attachment 139501 [details]
> > card0_error 2018-05-11
> 
> That one is a regular userspace hang.

What does that mean? Is it a Mesa bug? Either way, I don't see how a userspace process should be allowed to do anything that causes other processes to get EIO.

Comment 13 Francesco Balestrieri 2018-05-15 08:31:13 UTC

Thiago, can you or did you try the latest drm-tip that includes the patch Chris is referring to above?

Comment 14 Thiago Macieira 2018-05-15 17:12:12 UTC

(In reply to Francesco Balestrieri from comment #13)
> Thiago, can you or did you try the latest drm-tip that includes the patch
> Chris is referring to above?

I haven't tried that. I'm not a kernel developer, so I don't have a ready-made kernel build. The best I can do is use the latest release from Linus. The commit in question is not even in the latest -rc yet.

Comment 15 Francesco Balestrieri 2018-05-15 17:49:03 UTC

(In reply to Thiago Macieira from comment #14)
> (In reply to Francesco Balestrieri from comment #13)
> > Thiago, can you or did you try the latest drm-tip that includes the patch
> > Chris is referring to above?
> 
> I haven't tried that. I'm not a kernel developer, so I don't have a
> ready-made kernel build. The best I can do is use the latest release from
> Linus. The commit in question is not even in the latest -rc yet.

OK. For what it's worth, the instructions to build drm-tip are here: https://01.org/linuxgraphics/documentation/build-guide-0

Comment 16 Thiago Macieira 2018-05-18 00:45:34 UTC

Created attachment 139619 [details]
card0_error 2018-05-17

dmesg: 

[89982.954152] [drm] GPU HANG: ecode 9:0:0x8adfb5fe, in krunner [2949], reason: Hang on rcs0, action: reset
[89982.954155] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[89982.954156] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[89982.954156] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[89982.954157] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[89982.954157] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[89982.954178] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[89984.169411] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89984.169486] i915 0000:00:02.0: Resetting chip after gpu hang
[89985.386115] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89986.148508] asynchronous wait on fence i915:X[2695]/0:171977 timed out
[89986.708558] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89988.032543] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89988.140506] i915 0000:00:02.0: Failed to reset chip

The only difference this time is that it did not happen immediately after resuming from hibernation, but after about a minute. I managed to log in and see my desktop. It wasn't until I tried to use krunner that the hang was reported.

I'm going to go now one week without using the USB-C dock. Let's see if the hang happens without that.

Comment 17 Thiago Macieira 2018-05-30 02:10:50 UTC

Created attachment 139841 [details]
card0_error 2018-05-29

Nope, the USB-C dock is not an influence. This GPU hang happened without any USB-C connection.

[203461.307996] [drm] GPU HANG: ecode 9:0:0x22bfff23, in krunner [51086], reason: Hang on rcs0, action: reset
[203461.308001] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[203461.308004] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[203461.308006] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[203461.308009] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[203461.308012] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[203461.308074] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[203462.520867] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[203462.520996] i915 0000:00:02.0: Resetting chip after gpu hang
[203463.739120] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[203464.424718] asynchronous wait on fence i915:X[2357]/0:3f9891 timed out
[203465.063100] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[203466.384830] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[203466.492741] i915 0000:00:02.0: Failed to reset chip

Comment 18 Thiago Macieira 2018-06-02 00:26:14 UTC

Different message today. No card0/error was generated:

[113192.331640] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[113192.331704] i915 0000:00:02.0: Resetting chip after gpu hang
[113193.547824] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[113194.871950] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[113196.196713] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[113196.302119] i915 0000:00:02.0: Failed to reset chip
[113196.334112] asynchronous wait on fence i915:X[2820]/0:1db556 timed out

Comment 19 Thiago Macieira 2018-06-09 01:46:17 UTC

Created attachment 140103 [details]
card0_error 2018-06-08

Are more of these files useful? Is there any new information to be gleaned from them, or are they all saying the same thing?

Comment 20 Thiago Macieira 2018-06-22 23:26:19 UTC

Created attachment 140288 [details]
card0_error 2018-06-22

4.16.12

[136676.117554] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[136676.117601] i915 0000:00:02.0: Resetting chip after gpu hang
[136677.335666] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[136678.660287] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[136679.982186] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[136680.089493] i915 0000:00:02.0: Failed to reset chip

Comment 21 Thiago Macieira 2018-07-12 02:38:38 UTC

Created attachment 140582 [details]
card0_error 2018-07-11

kernel 4.17.3

[166313.855821] drm: not enough stolen space for compressed buffer (need 50688000 more bytes), disabling. Hint: you may be able to increase stolen memory size in the BIOS to avoid this.
[166320.700776] [drm] GPU HANG: ecode 9:-1:0x00000000, reason: Kicking stuck wait on rcs0, action: reset
[166320.700778] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[166320.700778] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[166320.700779] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[166320.700780] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[166320.700780] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[166320.700796] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[166321.911857] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[166321.911987] i915 0000:00:02.0: Resetting chip after gpu hang
[166323.115741] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[166324.426461] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[166325.735849] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[166325.843762] i915 0000:00:02.0: Failed to reset chip
[166327.091830] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[166327.715066] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 22 Simon Lee 2018-07-17 16:18:08 UTC

Hi Chris, Francesco,

Do you need any more information to progress?

Comment 23 Thiago Macieira 2018-07-20 23:56:53 UTC

Created attachment 140747 [details]
card0_error 2018-07-20

4.17.4

[97626.210963] [drm] GPU HANG: ecode 9:0:0x63ec03e1, in plasmashell [2620], reason: Hang on rcs0, action: reset
[97626.210966] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[97626.210967] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[97626.210968] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[97626.210969] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[97626.210970] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[97626.210993] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[97627.414130] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[97627.414279] i915 0000:00:02.0: Resetting chip after gpu hang
[97628.618081] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[97629.926095] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[97631.234083] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[97631.339429] i915 0000:00:02.0: Failed to reset chip
[97632.587457] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[97632.602102] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[97636.507400] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 24 Thiago Macieira 2018-08-03 00:13:30 UTC

Created attachment 140944 [details]
card0_error 2018-07-30

4.17.6

[200509.348277] [drm] GPU HANG: ecode 9:0:0xa3edbc82, in chrome [4770], reason: Hang on rcs0, action: reset
[200509.348279] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[200509.348280] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[200509.348280] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[200509.348281] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[200509.348282] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[200509.348297] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[200510.549451] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[200510.549500] i915 0000:00:02.0: Resetting chip after gpu hang
[200511.750749] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[200513.061331] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[200514.369250] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[200514.474645] i915 0000:00:02.0: Failed to reset chip
[200515.646641] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[200515.710684] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 25 Thiago Macieira 2018-08-03 00:14:20 UTC

Created attachment 140945 [details]
card0_error 2018-08-02

4.17.6

[164735.714076] [drm] GPU HANG: ecode 9:0:0x8463451a, in kscreenlocker_g [65753], reason: Hang on rcs0, action: reset
[164735.714078] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[164735.714079] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[164735.714080] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[164735.714080] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[164735.714081] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[164735.714095] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[164736.917478] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[164736.917566] i915 0000:00:02.0: Resetting chip after gpu hang
[164738.119008] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[164739.429550] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[164740.738040] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[164740.842805] i915 0000:00:02.0: Failed to reset chip
[164742.010813] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[164742.098849] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[164744.910725] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 26 Thiago Macieira 2018-08-09 16:26:11 UTC

Created attachment 141027 [details]
card0_error 2018-08-09

4.7.11

[167358.604501] [drm] GPU HANG: ecode 9:0:0x6140dc79, in plasmashell [2636], reason: Hang on rcs0, action: reset
[167358.604503] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[167358.604504] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[167358.604504] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[167358.604505] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[167358.604506] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[167358.604516] i915 0000:00:02.0: Resetting rcs0 after gpu hang
[167359.806616] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[167359.806759] i915 0000:00:02.0: Resetting chip after gpu hang
[167361.009010] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[167361.826462] asynchronous wait on fence i915:X[2471]/0:1a05d timed out
[167362.316299] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[167363.625181] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[167363.730468] i915 0000:00:02.0: Failed to reset chip
[167364.898452] i915 0000:00:02.0: i915_reset_device timed out, cancelling all in-flight rendering.
[167364.965180] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 27 Thiago Macieira 2018-08-09 16:26:41 UTC

That's 4.17.11

Comment 28 Thiago Macieira 2018-08-28 19:36:05 UTC

Now running 4.18.0, which does contain commit 77dfedb5be03779f9a5d83e323a1b36e32090105. Will report if I still experience issues.

Comment 29 Thiago Macieira 2018-08-30 15:57:41 UTC

Created attachment 141384 [details]
card0_error 2018-08-30

kernel 4.18.0, DMC 1.27. No changes in behaviour, having the exact same problem

dmesg:

[182585.295906] [drm] GPU HANG: ecode 9:0:0xdd607401, in plasmashell [2643], reason: hang on rcs0, action: reset
[182585.295910] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[182585.295910] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[182585.295911] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[182585.295912] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[182585.295912] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[182585.295971] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[182585.297223] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[182585.297253] i915 0000:00:02.0: Resetting chip for hang on rcs0
[182585.298805] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[182585.407018] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[182585.514988] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[182585.621681] i915 0000:00:02.0: Failed to reset chip
[182585.623039] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 30 Thiago Macieira 2018-08-31 00:28:01 UTC

Created attachment 141389 [details]
card0_error 2018-08-30 (second of the same day)

dmesg:

[27643.613948] [drm] GPU HANG: ecode 9:0:0x283b3249, in X [2799], reason: hang on rcs0, action: reset
[27643.613949] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[27643.613950] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[27643.613950] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[27643.613951] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[27643.613951] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[27643.613966] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[27643.615198] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[27643.615230] i915 0000:00:02.0: Resetting chip for hang on rcs0
[27643.616507] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[27643.723158] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[27643.831267] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[27643.941964] i915 0000:00:02.0: Failed to reset chip
[27643.943312] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[27645.285925] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 31 Thiago Macieira 2018-09-29 00:29:20 UTC

Created attachment 141783 [details]
card0_error 2018-09-18

[194075.388443] [drm] GPU HANG: ecode 9:0:0x575ec1a7, in kscreenlocker_g [49285], reason: hang on rcs0, action: reset
[194075.388446] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[194075.388448] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[194075.388449] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[194075.388450] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[194075.388452] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[194075.388481] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[194075.389750] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[194075.389827] i915 0000:00:02.0: Resetting chip for hang on rcs0
[194075.392877] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[194075.503680] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[194075.611830] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[194075.718389] i915 0000:00:02.0: Failed to reset chip
[194075.719758] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 32 Thiago Macieira 2018-09-29 00:29:57 UTC

Created attachment 141784 [details]
card0_error 2018-09-21

kernel 4.18.8

[110720.786094] [drm] GPU HANG: ecode 9:0:0x61a6fe91, in kwin_x11 [2608], reason: hang on rcs0, action: reset
[110720.786099] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[110720.786101] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[110720.786103] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[110720.786105] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[110720.786107] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[110720.786193] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[110720.787519] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[110720.787622] i915 0000:00:02.0: Resetting chip for hang on rcs0
[110720.789327] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[110720.896415] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[110721.008435] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[110721.119116] i915 0000:00:02.0: Failed to reset chip
[110721.120485] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 33 Thiago Macieira 2018-09-29 00:30:30 UTC

Created attachment 141785 [details]
card0_error 2018-09-28

kernel 4.18.8

[197950.408173] [drm] GPU HANG: ecode 9:0:0x8fdfbffe, in kmail [42709], reason: hang on rcs0, action: reset
[197950.408179] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[197950.408182] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[197950.408185] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[197950.408187] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[197950.408190] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[197950.408268] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[197950.409563] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[197950.409656] i915 0000:00:02.0: Resetting chip for hang on rcs0
[197950.411037] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[197950.520735] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[197950.628678] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[197950.735409] i915 0000:00:02.0: Failed to reset chip
[197950.736758] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[197951.811342] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 34 Thiago Macieira 2018-10-03 23:46:10 UTC

Created attachment 141866 [details]
card0_error 2018-10-03

kernel 4.18.8

[227443.174994] [drm] GPU HANG: ecode 9:0:0x8edb2106, in kmail [23122], reason: hang on rcs0, action: reset
[227443.174997] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[227443.174998] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[227443.174999] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[227443.175001] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[227443.175002] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[227443.175034] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[227443.176287] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[227443.176345] i915 0000:00:02.0: Resetting chip for hang on rcs0
[227443.177689] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[227443.283872] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[227443.395812] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[227443.502465] i915 0000:00:02.0: Failed to reset chip
[227443.503735] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 35 Thiago Macieira 2018-10-10 01:41:26 UTC

Created attachment 141968 [details]
card0_error 2018-10-09

kernel 4.18.9

[89670.191894] [drm] GPU HANG: ecode 9:0:0x00815216, in chrome [5645], reason: hang on rcs0, action: reset
[89670.191898] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[89670.191900] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[89670.191902] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[89670.191903] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[89670.191905] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[89670.191944] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[89670.193228] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89670.193313] i915 0000:00:02.0: Resetting chip for hang on rcs0
[89670.194772] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89670.300444] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89670.408385] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89670.519134] i915 0000:00:02.0: Failed to reset chip
[89670.520482] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[89671.894458] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 36 Thiago Macieira 2018-10-10 01:41:59 UTC

Is there light at the end of the tunnel? Is this fixed in any upcoming version?

Comment 37 Lakshmi 2018-10-10 06:55:40 UTC

I assume this is a mesa bug, so changing the product changing the product to Mesa.

Comment 38 Thiago Macieira 2018-10-10 18:23:59 UTC

(In reply to Lakshmi from comment #37)
> I assume this is a mesa bug, so changing the product changing the product to
> Mesa.

Considering it's a GPU hang, why do you assume it's a Mesa bug?

Comment 39 Lionel Landwerlin 2018-10-10 20:47:33 UTC

(In reply to Thiago Macieira from comment #38)
> (In reply to Lakshmi from comment #37)
> > I assume this is a mesa bug, so changing the product changing the product to
> > Mesa.
> 
> Considering it's a GPU hang, why do you assume it's a Mesa bug?

The last 4 error states you added indicate that most of the units of the 3d engine are not busy.

ACTHD does not seem to point to a location in the batch.

IPEHR is fairly weird too (last executed instruction):
 0x9e79016f (WTH is this?)
 0x710cdef8 (Still unknown...)
 0x70004000 (MEDIA_VFE_STATE, used for compute, but not present in the batch)

Also INSTDONE is bonkers :

  INSTDONE: 0xffdffffe
    PRB0 Ring Enable: false
    CS Done: false

  INSTDONE: 0xffd7fffe
    PRB0 Ring Enable: false
    GAM Done: false
    CS Done: false

Usually Ring Enable is true.

Does that look more like something that happens with display hangs?

Comment 40 Thiago Macieira 2018-10-10 21:03:22 UTC

There are ioctls returning EIO in a newly-launched process, like glxinfo. I'll get the exact ioctl that is failing next time this happens.

To me, that says the problem is inside the kernel. No matter what previous processes did, the kernel ought to honour the new ones.

Comment 41 Thiago Macieira 2018-10-23 17:58:35 UTC

Created attachment 142158 [details]
card0_error 2018-10-23

4.18.12

dmesg:
[409897.549764] [drm] GPU HANG: ecode 9:0:0x57abd315, in chrome [68092], reason: hang on rcs0, action: reset
[409897.549767] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[409897.549767] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[409897.549768] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[409897.549768] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[409897.549769] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[409897.549787] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[409897.551031] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[409897.551066] i915 0000:00:02.0: Resetting chip for hang on rcs0
[409897.552430] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[409897.661286] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[409897.769317] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[409897.875998] i915 0000:00:02.0: Failed to reset chip
[409897.877312] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

strace from glxinfo:

openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = 5</etc/drirc>
read(5</etc/drirc>, "<!--\n\n=========================="..., 4096) = 4096
getrandom("\xb2\x57\x3a\xe1\xb0\xb4\x25\x28", 8, GRND_NONBLOCK) = 8
read(5</etc/drirc>, "tion name=\"allow_glsl_builtin_va"..., 4096) = 4096
read(5</etc/drirc>, "n\" executable=\"AlienIsolation\">\n"..., 4096) = 4096
read(5</etc/drirc>, "lso higher gpu load. -->\n       "..., 4096) = 1354
read(5</etc/drirc>, "", 4096)           = 0
close(5</etc/drirc>)                    = 0
openat(AT_FDCWD, "/home/tjmaciei/.drirc", O_RDONLY) = -1 ENOENT (No such file or directory)
getrandom("\x06\x80\xb3\x56\x96\xe9\x0c\x07", 8, GRND_NONBLOCK) = 8
openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = 5</etc/drirc>
read(5</etc/drirc>, "<!--\n\n=========================="..., 4096) = 4096
getrandom("\xca\x88\xbd\x26\xbb\x9e\x85\xfd", 8, GRND_NONBLOCK) = 8
read(5</etc/drirc>, "tion name=\"allow_glsl_builtin_va"..., 4096) = 4096
read(5</etc/drirc>, "n\" executable=\"AlienIsolation\">\n"..., 4096) = 4096
read(5</etc/drirc>, "lso higher gpu load. -->\n       "..., 4096) = 1354
read(5</etc/drirc>, "", 4096)           = 0
close(5</etc/drirc>)                    = 0
openat(AT_FDCWD, "/home/tjmaciei/.drirc", O_RDONLY) = -1 ENOENT (No such file or directory)
geteuid()                               = 1000
getuid()                                = 1000
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfbb0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7ffc087cfbb0) = -1 ENOENT (No such file or directory)
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
futex(0x7f667f85f4e8, FUTEX_WAKE_PRIVATE, 2147483647) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_GET_APERTURE, 0x7ffc087cfca0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfbd0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_TILING, 0x7ffc087cfb20) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfbc4) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_GEM_CLOSE, 0x7ffc087cfb90) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_REG_READ, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GET_RESET_STATS, 0x7ffc087cfca0) = 0
brk(0x558cd1dd0000)                     = 0x558cd1dd0000
brk(0x558cd1df1000)                     = 0x558cd1df1000
brk(0x558cd1e12000)                     = 0x558cd1e12000
brk(0x558cd1e33000)                     = 0x558cd1e33000
brk(0x558cd1e54000)                     = 0x558cd1e54000
brk(0x558cd1e75000)                     = 0x558cd1e75000
brk(0x558cd1e96000)                     = 0x558cd1e96000
brk(0x558cd1eb7000)                     = 0x558cd1eb7000
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GETPARAM, 0x7ffc087cfc00) = 0
geteuid()                               = 1000
getuid()                                = 1000
getuid()                                = 1000
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, 0) = 5<socket:[12127798]>
connect(5<socket:[12127798]>, {sa_family=AF_UNIX, sun_path="/var/run/nscd/socket"}, 110) = 0
sendto(5<socket:[12127798]>, "\2\0\0\0\v\0\0\0\7\0\0\0passwd\0", 19, MSG_NOSIGNAL, NULL, 0) = 19
poll([{fd=5<socket:[12127798]>, events=POLLIN|POLLERR|POLLHUP}], 1, 5000) = 1 ([{fd=5, revents=POLLIN|POLLHUP}])
recvmsg(5<socket:[12127798]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="passwd\0", iov_len=7}, {iov_base="\310O\3\0\0\0\0\0", iov_len=8}], msg_iovlen=2, msg_control=[{cmsg_len=20, cmsg_level=SOL_SOCKET, cmsg_type=SCM_RIGHTS, cmsg_data=[6</var/lib/nscd/passwd>]}], msg_controllen=20, msg_flags=MSG_CMSG_CLOEXEC}, MSG_CMSG_CLOEXEC) = 15
mmap(NULL, 217032, PROT_READ, MAP_SHARED, 6</var/lib/nscd/passwd>, 0) = 0x7f6681489000
close(6</var/lib/nscd/passwd>)          = 0
close(5<socket:[12127798]>)             = 0
stat("/home/tjmaciei", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/tjmaciei/.cache", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/tjmaciei/.cache", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
stat("/home/tjmaciei/.cache/mesa_shader_cache", {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
openat(AT_FDCWD, "/home/tjmaciei/.cache/mesa_shader_cache/index", O_RDWR|O_CREAT|O_CLOEXEC, 0644) = 5</home/tjmaciei/dev/cache/mesa_shader_cache/index>
fstat(5</home/tjmaciei/dev/cache/mesa_shader_cache/index>, {st_mode=S_IFREG|0644, st_size=1310728, ...}) = 0
mmap(NULL, 1310728, PROT_READ|PROT_WRITE, MAP_SHARED, 5</home/tjmaciei/dev/cache/mesa_shader_cache/index>, 0) = 0x7f667eb7f000
close(5</home/tjmaciei/dev/cache/mesa_shader_cache/index>) = 0
rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
mmap(NULL, 8392704, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_STACK, -1, 0) = 0x7f667e37e000
mprotect(0x7f667e37f000, 8388608, PROT_READ|PROT_WRITE) = 0
clone(child_stack=0x7f667eb7dfb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f667eb7e9d0, tls=0x7f667eb7e700, child_tidptr=0x7f667eb7e9d0) = 3665
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
sched_setscheduler(3665, SCHED_IDLE, [0]) = 0
futex(0x7f667f7b6d80, FUTEX_WAKE_PRIVATE, 2147483647) = 0
openat(AT_FDCWD, "/dev/urandom", O_RDONLY) = 5</dev/urandom>
read(5</dev/urandom>, "\334x\331\366\315wu\265\0364\227\321\363\346r\222", 16) = 16
close(5</dev/urandom>)                  = 0
brk(0x558cd1ed8000)                     = 0x558cd1ed8000
getpid()                                = 3664
getpid()                                = 3664
getpid()                                = 3664
getpid()                                = 3664
getpid()                                = 3664
mmap(NULL, 266240, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f667e33d000
poll([{fd=3<socket:[12129015]>, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}])
writev(3<socket:[12129015]>, [{iov_base="\227#Z\3\1\0\0\0\4\0\0\0\1\0\0\0\t\r\0\0006\0\0\0\1\0\0\0\4\0\0\0"..., iov_len=3488}], 1) = 3488
poll([{fd=3<socket:[12129015]>, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3<socket:[12129015]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\0\247\33\0j\1\0\0\"\0\227\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 64
munmap(0x7f667e33d000, 266240)          = 0
getpid()                                = 3664
getpid()                                = 3664
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfb80) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfb74) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfb80) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfb74) = 0
brk(0x558cd1ef9000)                     = 0x558cd1ef9000
openat(AT_FDCWD, "/etc/drirc", O_RDONLY) = 5</etc/drirc>
read(5</etc/drirc>, "<!--\n\n=========================="..., 4096) = 4096
getrandom("\x65\xdd\x4e\xa9\x48\x6e\xaf\x7a", 8, GRND_NONBLOCK) = 8
read(5</etc/drirc>, "tion name=\"allow_glsl_builtin_va"..., 4096) = 4096
read(5</etc/drirc>, "n\" executable=\"AlienIsolation\">\n"..., 4096) = 4096
read(5</etc/drirc>, "lso higher gpu load. -->\n       "..., 4096) = 1354
read(5</etc/drirc>, "", 4096)           = 0
close(5</etc/drirc>)                    = 0
openat(AT_FDCWD, "/home/tjmaciei/.drirc", O_RDONLY) = -1 ENOENT (No such file or directory)
brk(0x558cd1f1b000)                     = 0x558cd1f1b000
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfba0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfb94) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_MMAP, 0x7ffc087cfba0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfba0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfb94) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_MMAP, 0x7ffc087cfba0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CONTEXT_CREATE, 0x7ffc087cfc20) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfbc0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfbb4) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfbb0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfba4) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_MMAP, 0x7ffc087cfbb0) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_CREATE, 0x7ffc087cfb60) = 0
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_SET_DOMAIN, 0x7ffc087cfb54) = 0
poll([{fd=3<socket:[12129015]>, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}])
writev(3<socket:[12129015]>, [{iov_base="\227\"\r\0\3\0`\7\233\0\0\0\0\0\0\0\0\0\0\0\1\0\0\0\3\0\0\0\221 \0\0"..., iov_len=56}], 1) = 56
poll([{fd=3<socket:[12129015]>, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3<socket:[12129015]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1\1\36\0\0\0\0\0\7\0\240\2\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
getpid()                                = 3664
getpid()                                = 3664
getpid()                                = 3664
recvmsg(3<socket:[12129015]>, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)
recvmsg(3<socket:[12129015]>, {msg_namelen=0}, 0) = -1 EAGAIN (Resource temporarily unavailable)
getpid()                                = 3664
mmap(NULL, 4096, PROT_READ|PROT_WRITE|PROT_EXEC, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7f668147b000
poll([{fd=3<socket:[12129015]>, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}])
writev(3<socket:[12129015]>, [{iov_base="N\0\4\0\1\0`\7j\1\0\0\22\1\0\0\1\30\f\0\4\0`\7j\1\0\0\0\0\0\0"..., iov_len=72}, {iov_base=NULL, iov_len=0}, {iov_base="", iov_len=0}], 3) = 72
poll([{fd=3<socket:[12129015]>, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=3, revents=POLLOUT}])
writev(3<socket:[12129015]>, [{iov_base="b\0\3\0\4\0\0\0DRI2", iov_len=12}], 1) = 12
poll([{fd=3<socket:[12129015]>, events=POLLIN}], 1, -1) = 1 ([{fd=3, revents=POLLIN}])
recvmsg(3<socket:[12129015]>, {msg_name=NULL, msg_namelen=0, msg_iov=[{iov_base="\1\0\"\0\0\0\0\0\1\232w\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0", iov_len=4096}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 32
ioctl(4</dev/dri/card0>, DRM_IOCTL_I915_GEM_EXECBUFFER2, 0x7ffc087d0220) = -1 EIO (Input/output error)
write(2</dev/pts/2>, "i965: Failed to submit batchbuff"..., 55i965: Failed to submit batchbuffer: Input/output error
) = 55
futex(0x558cd1eaea50, FUTEX_WAKE_PRIVATE, 2147483647) = 1
futex(0x558cd1eaea00, FUTEX_WAKE_PRIVATE, 1) = 1
getpid()                                = 3664
exit_group(1)                           = ?

As you can see near the end, the ioctl for DRM_IOCTL_I915_GEM_EXECBUFFER2 ends in EIO. This indicates the problem is in the kernel.

Comment 42 Thiago Macieira 2018-10-29 23:36:17 UTC

Created attachment 142266 [details]
card0_error 2018-10-29

4.18.12:

[304462.511265] [drm] GPU HANG: ecode 9:0:0x6f656195, in krunner [3106], reason: hang on rcs0, action: reset
[304462.511267] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[304462.511267] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[304462.511268] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[304462.511268] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[304462.511268] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[304462.511291] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[304462.512522] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[304462.512554] i915 0000:00:02.0: Resetting chip for hang on rcs0
[304462.513873] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[304462.621208] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[304462.729200] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[304462.835854] i915 0000:00:02.0: Failed to reset chip
[304462.837177] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 43 Lionel Landwerlin 2018-10-30 11:57:01 UTC

What version of Mesa are you running?

Comment 44 Thiago Macieira 2018-10-30 16:02:38 UTC

(In reply to Lionel Landwerlin from comment #43)
> What version of Mesa are you running?

18.1.7 currently.

Comment 45 Thiago Macieira 2018-11-03 01:52:02 UTC

Created attachment 142352 [details]
card0_error 2018-11-02

kernel 4.18.15

[144096.419245] [drm] GPU HANG: ecode 9:0:0x8adec402, in kmail [124429], reason: hang on rcs0, action: reset
[144096.419249] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[144096.419251] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[144096.419252] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[144096.419253] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[144096.419255] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[144096.419292] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[144096.420545] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[144096.420610] i915 0000:00:02.0: Resetting chip for hang on rcs0
[144096.421946] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[144096.530458] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[144096.638465] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[144096.745251] i915 0000:00:02.0: Failed to reset chip
[144096.746500] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 46 Thiago Macieira 2018-11-07 00:18:00 UTC

Created attachment 142394 [details]
card0_error 2018-11-06

4.18.15

[190495.326290] [drm] GPU HANG: ecode 9:0:0x60a3ff22, in qtcreator [5792], reason: hang on rcs0, action: reset
[190495.326296] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[190495.326306] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[190495.326309] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[190495.326312] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[190495.326316] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[190495.326362] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[190495.327658] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[190495.327748] i915 0000:00:02.0: Resetting chip for hang on rcs0
[190495.329158] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[190495.436783] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[190495.544748] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[190495.655468] i915 0000:00:02.0: Failed to reset chip
[190495.656807] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[190497.643368] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 47 Thiago Macieira 2018-11-17 06:00:26 UTC

Created attachment 142494 [details]
card0_error 2018-11-16

4.18.15

[251467.019461] [drm] GPU HANG: ecode 9:0:0xaedce18e, in chrome [3345], reason: hang on rcs0, action: reset
[251467.019512] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[251467.020747] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[251467.020802] i915 0000:00:02.0: Resetting chip for hang on rcs0
[251467.022269] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[251467.130920] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[251467.238851] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[251467.345543] i915 0000:00:02.0: Failed to reset chip
[251467.346885] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 48 Thiago Macieira 2018-11-17 06:02:04 UTC

Are we convinced already that this is NOT a Mesa bug, but an i915/firmware one?

 $ ltrace glxinfo
XOpenDisplay(0, 0x7fffcd6ced08, 0x7fffcd6ceae0, 0x7f9883fa2718)                          = 0x55e492342d00
__printf_chk(1, 0x55e491f86408, 0x55e492343f50, 0name of display: :0
)                                       = 20
glXChooseVisual(0x55e492342d00, 0, 0x55e491f8e200, 0)                                    = 0x55e492371e50
XFree(0x55e492371e50, 0x55e4924b7130, 1, 0)                                              = 1
glXChooseFBConfig(0x55e492342d00, 0, 0x7fffcd6ce9d0, 0x7fffcd6ce960)                     = 0x55e4923720e0
glXQueryExtensionsString(0x55e492342d00, 0, 1, 6)                                        = 0x55e4923726f0
strstr("GLX_ARB_create_context GLX_ARB_c"..., "GLX_ARB_create_context_profile")          = "GLX_ARB_create_context_profile G"...
strlen("GLX_ARB_create_context_profile")                                                 = 30
glXGetProcAddress(0x55e491f86124, 0x55e491f86998, 0x55e491f86998, 24)                    = 0x7f9884201fb0
XSetErrorHandler(0x55e491f82cd0, 0, 5, 0)                                                = 0x7f9883ff00d0
XSetErrorHandler(0x7f9883ff00d0, 1, 0x55e49238cc70, 0)                                   = 0x55e491f82cd0
XSetErrorHandler(0x55e491f82cd0, 0x55e4924aae30, 5, 5)                                   = 0x7f9883ff00d0
XSetErrorHandler(0x7f9883ff00d0, 0, 0x7f9884210320, 1)                                   = 0x55e491f82cd0
glXIsDirect(0x55e492342d00, 0x55e492372f50, 0x7f9884210320, 1)                           = 1
glXGetVisualFromFBConfig(0x55e492342d00, 0x55e4924aae30, 1, 0)                           = 0x55e492371e50
XFree(0x55e4923720e0, 0x55e492344ef0, 0, 0x55e4923451a0)                                 = 1
XCreateColormap(0x55e492342d00, 362, 0x55e49234e7e0, 0)                                  = 0x1c00001
XCreateWindow(0x55e492342d00, 362, 0, 0)                                                 = 0x1c00004
glXMakeCurrent(0x55e492342d00, 0x1c00004, 0x55e492372f50, 0xeff5i965: Failed to submit batchbuffer: Input/output error
 <no return ...>
+++ exited (status 1) +++

Comment 49 Thiago Macieira 2018-11-21 01:35:38 UTC

Created attachment 142531 [details]
card0_error 2018-11-20

kernel 4.19.1

[87496.664193] IPv6: ADDRCONF(NETDEV_CHANGE): wlp58s0: link becomes ready
[87498.501292] [drm] GPU HANG: ecode 9:0:0x0020b097, in X [2030], reason: hang on rcs0, action: reset
[87498.501302] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[87498.501307] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[87498.501311] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[87498.501315] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[87498.501320] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[87498.502367] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[87498.504230] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[87498.506822] i915 0000:00:02.0: Resetting chip for hang on rcs0
[87498.509713] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[87498.618498] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[87498.726635] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[87498.832790] i915 0000:00:02.0: Failed to reset chip
[87498.835687] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 50 Thiago Macieira 2018-12-01 22:27:22 UTC

Created attachment 142683 [details]
card0_error 2018-12-01

4.19.2

[254508.104715] [drm] GPU HANG: ecode 9:0:0x8fdfbffe, in chrome [5113], reason: hang on rcs0, action: reset
[254508.104720] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[254508.104720] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[254508.104721] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[254508.104722] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[254508.104723] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[254508.105734] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[254508.107470] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.107522] i915 0000:00:02.0: Resetting chip for hang on rcs0
[254508.110271] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.216901] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.325034] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.431224] i915 0000:00:02.0: Failed to reset chip
[254508.434066] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 51 Thiago Macieira 2018-12-01 22:28:53 UTC

Created attachment 142684 [details]
card0_error 2018-12-01

4.19.2

[254508.104715] [drm] GPU HANG: ecode 9:0:0x8fdfbffe, in chrome [5113], reason: hang on rcs0, action: reset
[254508.104720] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[254508.104720] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[254508.104721] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[254508.104722] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[254508.104723] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[254508.105734] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[254508.107470] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.107522] i915 0000:00:02.0: Resetting chip for hang on rcs0
[254508.110271] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.216901] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.325034] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[254508.431224] i915 0000:00:02.0: Failed to reset chip
[254508.434066] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 52 Thiago Macieira 2018-12-07 06:38:05 UTC

Created attachment 142743 [details]
card0_error 2018-12-06

4.19.2:

[247123.117705] [drm] GPU HANG: ecode 9:0:0x4144fc23, in kscreenlocker_g [103789], reason: hang on rcs0, action: reset
[247123.117707] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[247123.117708] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[247123.117709] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[247123.117710] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[247123.117711] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[247123.118721] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[247123.120463] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[247123.121185] i915 0000:00:02.0: Resetting chip for hang on rcs0
[247123.124833] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[247123.234729] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[247123.346730] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[247123.456988] i915 0000:00:02.0: Failed to reset chip
[247123.459741] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 53 Romek 2018-12-07 09:40:46 UTC

Created attachment 142744 [details]
card0_error_2018-12-07_lenovo_S300

Comment 54 Romek 2018-12-07 09:41:23 UTC

I have Lenovo S300, Kabylake, kernel 4.19.4, Arch, KDE desktop + chromium. Been getting this issue ever since I've started to use this laptop (a year or so). Crash happen only after resume from hibernation. Sometimes immediately when logging back, sometimes in a matter of minutes.

[56572.472400] [drm] GPU HANG: ecode 9:0:0x893bdd9d, in chromium [15360], reason: hang on rcs0, action: reset
[56572.472402] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[56572.472403] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[56572.472403] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[56572.472404] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[56572.472404] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[56572.473413] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[56572.475145] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[56572.475181] i915 0000:00:02.0: Resetting chip for hang on rcs0
[56572.477939] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[56572.585555] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[56572.692218] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[56572.797101] i915 0000:00:02.0: Failed to reset chip
[56572.799896] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 55 Thiago Macieira 2018-12-07 16:40:52 UTC

Moved back from Mesa. This is not a Mesa bug.

Moved back from ASSIGNED, since clearly no one is working on this.

Comment 56 Thiago Macieira 2018-12-08 19:59:04 UTC

Created attachment 142755 [details]
card0_error 2018-12-08

4.19.2

[49041.720961] [drm] GPU HANG: ecode 9:0:0x848acd64, in X [2032], reason: hang on rcs0, action: reset
[49041.720964] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[49041.720965] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[49041.720965] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[49041.720966] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[49041.720966] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[49041.721986] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[49041.723719] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[49041.723754] i915 0000:00:02.0: Resetting chip for hang on rcs0
[49041.726515] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[49041.833834] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[49041.941847] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[49042.048105] i915 0000:00:02.0: Failed to reset chip
[49042.050856] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout

Comment 57 Thiago Macieira 2018-12-14 03:01:36 UTC

Created attachment 142810 [details]
card0_error 2018-12-13

4.19.7:

[171009.688247] [drm] GPU HANG: ecode 9:0:0x9bfd1292, in X [2016], reason: hang on rcs0, action: reset
[171009.688253] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[171009.688256] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[171009.688258] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[171009.688261] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[171009.688264] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[171009.689304] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[171009.691076] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[171009.691155] i915 0000:00:02.0: Resetting chip for hang on rcs0
[171009.693938] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[171009.802277] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[171009.910299] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[171010.016473] i915 0000:00:02.0: Failed to reset chip
[171010.019319] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[171015.308387] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 58 tomaik 2018-12-15 12:46:50 UTC

Possibly related to bug 108717.

Comment 59 Thiago Macieira 2019-01-10 00:36:35 UTC

Created attachment 143049 [details]
card0_error 2019-01-09

4.19.11:

[86214.433065] [drm] GPU HANG: ecode 9:0:0xb7dea192, in X [2061], reason: hang on rcs0, action: reset
[86214.433068] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[86214.433069] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[86214.433069] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[86214.433070] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[86214.433071] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[86214.434080] i915 0000:00:02.0: Resetting rcs0 for hang on rcs0
[86214.435813] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[86214.435852] i915 0000:00:02.0: Resetting chip for hang on rcs0
[86214.438619] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[86214.545725] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[86214.653728] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[86214.759982] i915 0000:00:02.0: Failed to reset chip
[86214.762731] [drm:gen8_reset_engines [i915]] *ERROR* rcs0: reset request timeout
[86216.711990] [drm] Reducing the compressed framebuffer size. This may lead to less power savings than a non-reduced-size. Try to increase stolen memory size if available in BIOS.

Comment 60 Romek 2019-01-19 14:34:28 UTC

I'm on kernel 4.20.3 since a few days and the have not yet experienced a crash.

Comment 61 Thiago Macieira 2019-01-19 16:50:35 UTC

(In reply to Romek from comment #60)
> I'm on kernel 4.20.3 since a few days and the have not yet experienced a
> crash.

Thanks, that's good to know. openSUSE Tumbleweed has 4.20.0 available now. I'll upgrade and see what happens.

Comment 62 Thiago Macieira 2019-01-30 17:03:56 UTC

(In reply to Romek from comment #60)
> I'm on kernel 4.20.3 since a few days and the have not yet experienced a
> crash.

Uptime now 8.5 days on 4.20.0, which is a good statistic confidence that it's fixed. Let's wait for 14 days, which is unheard of.

Comment 63 Thiago Macieira 2019-02-02 06:00:56 UTC

Uptime is now 11 days on 4.20.0. Statistically speaking, this bug is fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.