Bug 54226

Summary:

[snb 3.5] stale semaphore sync seqno (typically as seen on bcs->rcs)

Product:

DRI

Reporter:

mikhail.v.gavrilov

Component:

DRM/Intel

Assignee:

Chris Wilson <chris>

Status:

CLOSED WONTFIX

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

high

CC:

aaron.lu, admin, ajpikul, alban.crequy, alinm.elena, alpine.art.de, biovore, bogdan.nicolae, brian.drosan, brian.m.hill, bugmenot, bugsfreedesktop, bugs.freedesktop.org, bugzilla, bugzilla, caravena, cheese, chris, cobreces, cosmingiurgiu, daniel, danilo.pianini, desintegr, dhaval.giani, diego.viola, fan4326, felix+freedesktop, fermuch, fm33, foxbow, francisbrwn9, freedesktop-bugs, freedesktop-bugzilla, freedesktop, fweimer, g9772968, gennady.uraltsev+bugs, gerdesj, ghostrider, giulio.genovese, gombasg, graycham2013, hamer.mk, hobbes, hrattis, h.reindl, idokan, januszmk6, jbarnes, jefbed, jmathew6200, jonasthiem, josh, koehler, kurt, libreoffice, linuxhippy, longerdev, louie.lu, lyn.lyn, lynxis, mail, mango, Manuel.h87, mariusz.libera, martin.j.bartlett, Martin, matwey.kornilov, menchini, mezin.alexander, mika.kuoppala, mikhail.v.gavrilov, mikolaj.bugzilla, mirq-boogs, mort.yao, mroos, murks, m-widmer, naoliv, narma.nsk, neembi, nemesis, np.hardass, pavel989, peter.alfredsen, peter, post+fdo, ralf, reztho, roliverio.ve, russ.pridemore, RyanOwens, ryao, sadieperkins, samuel.rakitnican, samuel, sergeev917, s.feltman, shuber, smruti.patil, sputnick, stolowski, tarkane, theholyettlz, thuryn1, tiposchi, tobias.polzer, tomi, torkelatgenet, torvalds, vasil, yeled.nova, zephrax

Version:

unspecified

Hardware:

All

OS:

All

Whiteboard:

ReadyForDev

i915 platform:

ALL

i915 features:

Attachments:

Description	Flags
dmesg output	none
i915_error_state	none
i915_error_state (new)	none
i915_error_state (new)	none
dmesg output (new)	none
i915_error_state	none
dmesg	none
i915_error_state	none
dmesg	none
Read back semaphore mboxes after update	none
write mbox regs twice on snb	none
write mbox regs twice on snb, v2	none
kernel.spec	none
i915_error_state	none
i915_error_state (kernel 3.8 Ubuntu)	none
i915_error_state (kernel 3.7 Fedora)	none
i915_error_state (kernel 3.7 Fedora)	none
i915_error_state (kernel 3.7 Fedora)	none
i915_error_state (kernel 3.8.1 Fedora)	none
i915_error_state (kernel 3.8.1 Fedora) with path (write mbox regs twice on snb, v2)	none
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)	none
kernel.spec	none
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)	none
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)	none
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)	none
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)	none
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)	none
[PATCH] drm/i915: Resurrect ring kicking for semaphores, selectively	none
i915_error_state (kernel 3.8.5 Fedora) with path (drm/i915: Resurrect ring kicking for semaphores, selectively)	none
dmesg (kernel 3.8.5 Fedora) with path (drm/i915: Resurrect ring kicking for semaphores, selectively)	none
i915_error_state (kernel 3.9 Fedora)	none
i915_error_state (kernel 3.9 Fedora)	none
i915_error_state - kernel 3.10-rc2, dual monitor, Dell E6430	none
i915_error_state - 3.9.2-201.rhbz879823.fc18.x86_64 (included patch write mbox regs twice on snb, v2)	none
New read-after-write patch	none
New read-after-write patch	none
New read-after-write patch	none
i915_error_state with new patch	none
i915_error_state (kernel 3.11.3)	none
i915_error_state	none
X -version output	none
i915_error_state (kernel 3.11.6, mesa 9.2.2, xf86-video-intel 2.99.906)	none
i915_error_state	none
Another version of the same hang - directed here from bug 75502	none
Kernel 3.14.2-1-ARCH, xf86-video-intel 2.99.911-2, mesa 10.1.2-1	none
card0-error.071714-cwawak - gpu dump	none
attachment-28908-0.html	none
attachment-32271-0.html	none
error state with 4.2 kernel	none
gpu error file on 4.13.5-200.fc26.x86_64	none

Description mikhail.v.gavrilov 2012-08-29 19:19:16 UTC

Created attachment 66289 [details]
dmesg output

From time to time interface freezes, and in dmesg appear these records: [drm:i915_hangcheck_ring_idle] *ERROR* Hangcheck timer elapsed... blitter ring idle

$ lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 05)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b5)
00:1c.1 PCI bridge: Intel Corporation 82801 PCI Bridge (rev b5)
00:1c.2 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 3 (rev b5)
00:1c.3 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 4 (rev b5)
00:1c.4 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 5 (rev b5)
00:1d.0 USB Controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 05)
00:1f.0 ISA bridge: Intel Corporation H61 Express Chipset Family LPC Controller (rev 05)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 05)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 05)
02:00.0 PCI bridge: ASMedia Technology Inc. Device 1080 (rev 01)
03:01.0 Multimedia audio controller: VIA Technologies Inc. VT1720/24 [Envy24PT/HT] PCI Multi-Channel Audio Controller (rev 01)
04:00.0 Ethernet controller: Atheros Communications AR8151 v2.0 Gigabit Ethernet (rev c0)
05:00.0 USB Controller: ASMedia Technology Inc. ASM1042 SuperSpeed USB Host Controller
06:00.0 SATA controller: ASMedia Technology Inc. Device 0612 (rev 01)

Comment 1 Chris Wilson 2012-10-21 18:10:55 UTC

If you can easily reproduce this error, can you please build a kernel using http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=xv-overlay which has some revised memory barriers.

Comment 2 mikhail.v.gavrilov 2012-10-27 08:06:45 UTC

Can you help me to build rpm for fedora?

Comment 3 Chris Wilson 2012-11-22 09:53:50 UTC

On second thoughts, I think this should be fixed by the slight robustification in more recent hangcheck.

Please try the latest kernel for your distribution (should be 3.6.7 atm) and reopen if it still occurs.

Comment 4 mikhail.v.gavrilov 2012-11-24 13:13:12 UTC

I am use Fedora 18 with 3.6.7-5.fc18.i686 kernel and in dmesg output still exists message:
[22826.654365] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[22826.654369] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state

Comment 5 Chris Wilson 2012-11-24 14:39:24 UTC

That is not the same bug, so you need to attach a fresh set of debug info (please remember the i915_error_state)...

Comment 6 mikhail.v.gavrilov 2012-11-24 14:42:03 UTC

Please, explain how get needed debug info. Thanks.

Comment 7 Chris Wilson 2012-11-24 14:50:07 UTC

http://intellinuxgraphics.org/how_to_report_bug.html

From which we need the i915_error_state, so

$ sudo mount -tdebugfs debug /sys/kernel/debug
$ sudo cat /sys/kernel/debug/dri/0/i915_error_state > i915_error_state

Comment 8 mikhail.v.gavrilov 2012-11-24 14:57:07 UTC

Created attachment 70518 [details]
i915_error_state

Comment 9 Chris Wilson 2012-11-24 15:09:56 UTC

Looks that corresponds to the bug

commit 1c8b46fc8c865189f562c9ab163d63863759712f
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Nov 14 09:15:14 2012 +0000

    drm/i915: Use LRI to update the semaphore registers
    
    The bspec was recently updated to remove the ability to update the
    semaphore using the MI_SEMAPHORE_BOX command, the ability to wait upon
    the semaphore value remained. Instead the advice is to update the
    register using the MI_LOAD_REGISTER_IMM command. In cursory testing,
    semaphores continue to function - the question is whether this fixes
    some of the deadlocks where the semaphore registers contained stale
    values?
    
hopefully addresses.

That patch is only available on drm-intel-next at the moment, which is available either at http://cgit.freedesktop.org/~danvet/drm-intel or available as drm-intel-experimental in the ubuntu kernel-ppa.

Comment 10 mikhail.v.gavrilov 2012-12-08 11:37:21 UTC

Problem repeated with patched kernel.

[118637.439016] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[118637.439020] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[mikhail@localhost ~]$ uname -a
Linux localhost.localdomain 3.6.9-4.1.fc18.i686.PAE #1 SMP Wed Dec 5 15:16:33 UTC 2012 i686 i686 i386 GNU/Linux
[mikhail@localhost ~]$ sudo cat /sys/kernel/debug/dri/0/i915_error_state > i915_error_state
[sudo] password for mikhail: 
[mikhail@localhost ~]$

Comment 11 mikhail.v.gavrilov 2012-12-08 11:38:52 UTC

Created attachment 71192 [details]
i915_error_state (new)

Comment 12 mikhail.v.gavrilov 2012-12-08 13:36:15 UTC

sudo cat /sys/kernel/debug/dri/0/i915_error_state > i915_error_state-8
cat: /sys/kernel/debug/dri/0/i915_error_state: Cannot allocate memory


What it mean??

Comment 13 mikhail.v.gavrilov 2012-12-08 13:37:34 UTC

Created attachment 71199 [details]
i915_error_state (new)

Comment 14 mikhail.v.gavrilov 2012-12-08 14:07:49 UTC

Created attachment 71200 [details]
dmesg output (new)

Comment 15 Chris Wilson 2012-12-08 17:22:52 UTC

Lalalalala.

Comment 16 Chris Wilson 2012-12-09 21:25:18 UTC

*** Bug 58057 has been marked as a duplicate of this bug. ***

Comment 17 Chris Wilson 2012-12-12 21:34:41 UTC

*** Bug 58212 has been marked as a duplicate of this bug. ***

Comment 18 Chris Wilson 2012-12-13 08:30:44 UTC

We can confirm the synopsis by disabling semaphores (i915.semaphore=0), but can we also test whether this is an rc6 side-effect (i915.i915_enable_rc6-0)?

Comment 19 Chris Wilson 2012-12-13 08:35:07 UTC

Also maybe time for ' git revert 4e0e90dcb8a7df1229c69e30abebb59b0b3c2a1f'

Comment 20 mikhail.v.gavrilov 2012-12-15 14:20:12 UTC

Created attachment 71549 [details]
i915_error_state

Comment 21 mikhail.v.gavrilov 2012-12-15 14:21:59 UTC

Created attachment 71550 [details]
dmesg

Comment 22 mikhail.v.gavrilov 2012-12-17 07:24:13 UTC

Created attachment 71629 [details]
i915_error_state

Comment 23 mikhail.v.gavrilov 2012-12-17 07:24:33 UTC

Created attachment 71630 [details]
dmesg

Comment 24 Chris Wilson 2012-12-30 10:28:48 UTC

Mikhail, for the time being you can set i915.semaphores=0 (or echo 0 > /sys/modules/i915/parameters/semaphores) to prevent this hang.

The only interesting patch I can suggest atm is

commit 31643d54a739382626c27c0f2a12b3bbc22d1a38
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Wed Sep 26 10:34:01 2012 -0700

    drm/i915: Workaround to bump rc6 voltage to 450
    
    BIOS should be setting the minimum voltage for rc6 to be 450mV. Old or
    buggy BIOSen may not be doing this, so we correct it for them. Ideally
    customers should update the BIOS as only it would know the optimal
    values for the platform, so we leave that fact as a DRM_ERROR for the
    user to see.

in 3.8-rc1 or look for a BIOS update.

Comment 25 Chris Wilson 2013-01-03 16:00:37 UTC

*** Bug 58986 has been marked as a duplicate of this bug. ***

Comment 26 Chris Wilson 2013-01-10 01:10:49 UTC

Created attachment 72766 [details] [review]
Read back semaphore mboxes after update

Can you please try this patch, enable semaphores and see if the bug persists?

Comment 27 mikhail.v.gavrilov 2013-01-10 01:42:16 UTC

(In reply to comment #24)
> Mikhail, for the time being you can set i915.semaphores=0 (or echo 0 >
> /sys/modules/i915/parameters/semaphores) to prevent this hang.

What are the consequences?

> The only interesting patch I can suggest atm is
> 
> commit 31643d54a739382626c27c0f2a12b3bbc22d1a38
> Author: Ben Widawsky <ben@bwidawsk.net>
> Date:   Wed Sep 26 10:34:01 2012 -0700
> 
>     drm/i915: Workaround to bump rc6 voltage to 450
>     
>     BIOS should be setting the minimum voltage for rc6 to be 450mV. Old or
>     buggy BIOSen may not be doing this, so we correct it for them. Ideally
>     customers should update the BIOS as only it would know the optimal
>     values for the platform, so we leave that fact as a DRM_ERROR for the
>     user to see.
> 
> in 3.8-rc1 or look for a BIOS update.

I have H61M/U3S3 motherboard and you latest BIOS ver 2.20 from 8/15/2012
ftp://174.142.97.10/bios/1155/H61MU3S3(2.20)ROM.zip
How to check problem persists or not?

Comment 28 Chris Wilson 2013-01-10 02:30:37 UTC

(In reply to comment #27)
> (In reply to comment #24)
> > Mikhail, for the time being you can set i915.semaphores=0 (or echo 0 >
> > /sys/modules/i915/parameters/semaphores) to prevent this hang.
> 
> What are the consequences?

Rendering throughput is dropped by 10% with SNA, or as much as 3x with UXA. OpenGL performance is likely to be reduced by about 30%. More CPU time is spent waiting for the GPU with rc6 disabled, so increased power consumption.

Comment 29 Ben Widawsky 2013-01-20 20:07:03 UTC

(In reply to comment #27)

> > The only interesting patch I can suggest atm is
> > 
> > commit 31643d54a739382626c27c0f2a12b3bbc22d1a38
> > Author: Ben Widawsky <ben@bwidawsk.net>
> > Date:   Wed Sep 26 10:34:01 2012 -0700
> > 
> >     drm/i915: Workaround to bump rc6 voltage to 450
> >     
> >     BIOS should be setting the minimum voltage for rc6 to be 450mV. Old or
> >     buggy BIOSen may not be doing this, so we correct it for them. Ideally
> >     customers should update the BIOS as only it would know the optimal
> >     values for the platform, so we leave that fact as a DRM_ERROR for the
> >     user to see.
> > 
> > in 3.8-rc1 or look for a BIOS update.
> 
> I have H61M/U3S3 motherboard and you latest BIOS ver 2.20 from 8/15/2012
> ftp://174.142.97.10/bios/1155/H61MU3S3(2.20)ROM.zip
> How to check problem persists or not?

The easiest way is to apply the patch and look for DRM_DEBUG_DRIVER messages. This is unlikely to fix the problem, but also can't hurt.

We've only assumed new BIOS will fix the problem, but who knows. Especially if it's a 3rd party BIOS.

Comment 30 Chris Wilson 2013-01-24 10:33:59 UTC

*** Bug 59786 has been marked as a duplicate of this bug. ***

Comment 31 Daniel Vetter 2013-01-24 11:07:56 UTC

Created attachment 73560 [details] [review]
write mbox regs twice on snb

Another piece of magic which might help. Please test this patch and the one from Chris ("Read back semaphore mboxes after update") separately and report back whether anything changes.

Comment 32 Daniel Vetter 2013-01-24 13:21:57 UTC

Created attachment 73577 [details] [review]
write mbox regs twice on snb, v2

Now actually the right patch attached, the old one didn't compile ...

Comment 33 mikhail.v.gavrilov 2013-01-30 21:01:58 UTC

Which patch I need applied for fix this issue?

I see that patches from comment 26 and 32  have similar logic...

@@ -596,6 +606,16 @@ gen6_add_request(struct intel_ring_buffer *ring)
 	intel_ring_emit(ring, MI_USER_INTERRUPT);
 	intel_ring_advance(ring);
 
+	if (IS_GEN6(ring->dev)) {
+		ret = intel_ring_begin(ring, 6);
+		if (ret)
+			return ret;
+
+		read_mboxes(ring, mbox1_reg, 1024);
+		read_mboxes(ring, mbox2_reg, 1028);
+		intel_ring_advance(ring);
+	}
+
 	return 0;
 }

@@ -598,6 +598,19 @@ gen6_add_request(struct intel_ring_buffer *ring)
 	intel_ring_emit(ring, MI_USER_INTERRUPT);
 	intel_ring_advance(ring);
 
+	if (IS_GEN6(ring->dev)) {
+		ret = intel_ring_begin(ring, 6);
+		if (ret)
+			return ret;
+
+		mbox1_reg = ring->signal_mbox[0];
+		mbox2_reg = ring->signal_mbox[1];
+
+		update_mboxes(ring, mbox1_reg);
+		update_mboxes(ring, mbox2_reg);
+		intel_ring_advance(ring);
+	}
+
 	return 0;
 }

Comment 34 Daniel Vetter 2013-01-30 21:37:10 UTC

> --- Comment #33 from mikhail.v.gavrilov@gmail.com ---
> Which patch I need applied for fix this issue?

We can't reproduce the bug, so those are just patches to test
different ideas. Please test them both each individually (i.e. remove
the first before testing the 2nd patch) and the report whether
anything changes (i.e. harder or easier for you to hit the issue).

Comment 35 mikhail.v.gavrilov 2013-02-02 14:25:05 UTC

Can't compile kernel with patch above:

drivers/gpu/drm/i915/intel_ringbuffer.c: In function 'gen6_add_request':
drivers/gpu/drm/i915/intel_ringbuffer.c:611:3: error: too few arguments to function 'update_mboxes'
drivers/gpu/drm/i915/intel_ringbuffer.c:557:1: note: declared here
drivers/gpu/drm/i915/intel_ringbuffer.c:612:3: error: too few arguments to function 'update_mboxes'
drivers/gpu/drm/i915/intel_ringbuffer.c:557:1: note: declared here
make[4]: *** [drivers/gpu/drm/i915/intel_ringbuffer.o] Error 1
make[3]: *** [drivers/gpu/drm/i915] Error 2
make[2]: *** [drivers/gpu/drm] Error 2
make[1]: *** [drivers/gpu] Error 2
make[1]: *** Waiting for unfinished jobs....
make: *** [drivers] Error 2
make: *** Waiting for unfinished jobs....

Comment 36 mikhail.v.gavrilov 2013-02-02 14:25:54 UTC

Created attachment 74087 [details]
kernel.spec

Comment 37 mikhail.v.gavrilov 2013-02-10 18:49:10 UTC

Created attachment 74561 [details]
i915_error_state

Comment 38 mikhail.v.gavrilov 2013-02-10 20:26:14 UTC

Created attachment 74566 [details]
i915_error_state (kernel 3.8 Ubuntu)

Comment 39 mikhail.v.gavrilov 2013-02-13 19:28:34 UTC

Created attachment 74779 [details]
i915_error_state (kernel 3.7 Fedora)

Comment 40 mikhail.v.gavrilov 2013-02-13 20:05:33 UTC

Created attachment 74781 [details]
i915_error_state (kernel 3.7 Fedora)

Comment 41 mikhail.v.gavrilov 2013-02-15 03:22:32 UTC

Created attachment 74850 [details]
i915_error_state (kernel 3.7 Fedora)

Comment 42 Norman Yarvin 2013-02-20 05:00:01 UTC

I'm seeing this bug, or something like it, on an older chip (G965, desktop version):

Feb 19 22:05:56 muttonhead kernel: [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Feb 19 22:05:56 muttonhead kernel: [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
Feb 19 22:05:56 muttonhead kernel: [drm:kick_ring] *ERROR* Kicking stuck wait on render ring
Feb 19 22:05:57 muttonhead kernel: [drm:i915_reset] *ERROR* Failed to reset chip.

after which the mouse pointer sticks in one spot (with most other things working), and then when I shut down X, the console fails to appear, requiring a reboot.  Not knowing that the given file path was under /sys/kernel, I failed to capture the error state, but will do so next time this happens (which is maybe every other day).  This is with a 3.7 kernel (Gentoo); before 3.7, the driver was stable.  I don't know what the 'generation' numbers in the driver mean, but I'm guessing that generation 6 is later, so many of the suggested fixes would not make any difference on this machine.

Comment 43 Chris Wilson 2013-02-20 09:13:34 UTC

(In reply to comment #42)
> I'm seeing this bug, or something like it, on an older chip (G965, desktop
> version):

Good news, it is not this bug. Please make sure you have the latest stable driver (a gentoo user not using 3.8 already! ;-) and latest xf86-video-intel, then file a fresh bug report, attaching your dmesg, Xorg.0.log and i915_error_state.

Comment 44 Alberto González 2013-02-20 14:04:23 UTC

I subscribed to this bug because I was seeing this hang too. It happened randomly several times, without a specific cause or way to reproduce it.

This was around December, and it happened maybe 4-5 times along a month. The GPU would hang with that error in dmesg, and everything continued to work, though very slowly.

However, I must say that since then it didn't happen again for almost 2 months maybe. I use Arch Linux, which means I always update to the latest stable packages of everything, so it seems that for me it got solved at some point (or at least much harder to reproduce).

This is an Ironlake / HD 2000 based Dell laptop. I did update the BIOS when I found this bug report, but it didn't solve the problem, the hang happened after updating it.

Comment 45 Chris Wilson 2013-02-22 22:14:55 UTC

*** Bug 61310 has been marked as a duplicate of this bug. ***

Comment 46 mikhail.v.gavrilov 2013-03-03 07:05:37 UTC

Created attachment 75818 [details]
i915_error_state (kernel 3.8.1 Fedora)

Comment 47 mikhail.v.gavrilov 2013-03-03 07:07:04 UTC

Today Fedora 18 updated kernel to 3.8.1 and message "[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung" still here. Please look at my last log. Any updates?

Comment 48 Ben Widawsky 2013-03-06 02:29:32 UTC

This looks weird to me:

0x00005a58:      0x11000001: MI_LOAD_REGISTER_IMM
0x00005a5c:      0x00012044:    dword 1
0x00005a60:      0x0043b625:    dword 2
0x00005a64:      0x11000001: MI_LOAD_REGISTER_IMM
0x00005a68:      0x00022040:    dword 1
0x00005a6c:      0x0043b625:    dword 2
0x00005a70:      0x10800001: MI_STORE_DATA_INDEX
0x00005a74:      0x00000080:    index
0x00005a78:      0x0043b625:    dword
0x00005a7c:      0x01000000: MI_USER_INTERRUPT
0x00005a80:      0x0b160001: MI_SEMAPHORE_MBOX compare semaphore, use compare reg 2
0x00005a84:      0x0043b625:    value
0x00005a88:      0x00000000:    address
0x00005a8c:      0x00000000: MI_NOOP


Chris?

Comment 49 Chris Wilson 2013-03-06 09:03:14 UTC

Weird? Did you just forget about that the hw does a strictly greater-than comparison?

Comment 50 Chris Wilson 2013-03-06 09:04:19 UTC

(In reply to comment #47)
> Today Fedora 18 updated kernel to 3.8.1 and message
> "[drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung"
> still here. Please look at my last log. Any updates?

We're still waiting upon you apply patches and report.

Comment 51 Daniel Vetter 2013-03-06 22:51:02 UTC

*** Bug 61925 has been marked as a duplicate of this bug. ***

Comment 52 mikhail.v.gavrilov 2013-03-08 22:20:09 UTC

Created attachment 76196 [details]
i915_error_state (kernel 3.8.1 Fedora) with path (write mbox regs twice on snb, v2)

I am applied patch "write mbox regs twice on snb, v2" but still have problem [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

Comment 53 mikhail.v.gavrilov 2013-03-09 08:16:24 UTC

Created attachment 76208 [details]
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)

I am also applied patch "Read back semaphore mboxes after update" but still have problem [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

Comment 54 Chris Wilson 2013-03-09 10:24:13 UTC

(In reply to comment #52)
> Created attachment 76196 [details]
> i915_error_state (kernel 3.8.1 Fedora) with path (write mbox regs twice on
> snb, v2)
> 
> I am applied patch "write mbox regs twice on snb, v2" but still have problem
> [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung

0x00052cc8:      0x18800100: MI_BATCH_BUFFER_START
0x00052ccc:      0x0d59b000:    dword 1
0x00052cd0:      0x13000001: MI_FLUSH_DW post_sync_op='no write' 
0x00052cd4:      0x000000c4:    address
0x00052cd8:      0x00000000:    dword
0x00052cdc:      0x00000000: MI_NOOP
0x00052ce0:      0x11000001: MI_LOAD_REGISTER_IMM
0x00052ce4:      0x00002044:    dword 1
0x00052ce8:      0x0007a582:    dword 2
0x00052cec:      0x11000001: MI_LOAD_REGISTER_IMM
0x00052cf0:      0x00012040:    dword 1
0x00052cf4:      0x0007a582:    dword 2
0x00052cf8:      0x10800001: MI_STORE_DATA_INDEX
0x00052cfc:      0x00000080:    index
0x00052d00:      0x0007a582:    dword
0x00052d04:      0x01000000: MI_USER_INTERRUPT

That's only a single LRI per semaphore, the patch wasn't tested.

Comment 55 Chris Wilson 2013-03-09 10:25:44 UTC

I would say '3.8.1-203.fc18.i686.PAE' was the distro kernel and not your patched version.

Comment 56 mikhail.v.gavrilov 2013-03-09 11:56:27 UTC

Created attachment 76215 [details]
kernel.spec

(In reply to comment #55)
> I would say '3.8.1-203.fc18.i686.PAE' was the distro kernel and not your
> patched version.

It's impossible. Distro kernel is 3.8.1-201.fc18.i686.PAE. 3.8.1-202.fc18.i686.PAE and 3.8.1-203.fc18.i686.PAE is kernels patched by me.

You can sure if look at my build spec file.

Comment 57 mikhail.v.gavrilov 2013-03-09 19:31:22 UTC

Created attachment 76239 [details]
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)

I am sorry. Seems I forgot add "ApplyPatch" to spec. I am rebuild kernel with "0001-drm-i915-Read-back-semaphore-mboxes-after-updating-t.patch" patch, but seems problem still here.

Does it make sense to check the "0001-write-mbox-regs-twice-on-gen6.patch" patch?

Comment 58 mikhail.v.gavrilov 2013-03-09 21:19:24 UTC

Created attachment 76243 [details]
i915_error_state (kernel 3.8.1 Fedora) with path (Read back semaphore mboxes after update)

Comment 59 mikhail.v.gavrilov 2013-03-10 09:05:05 UTC

Created attachment 76261 [details]
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)

"write mbox regs twice on snb, v2" patch also not solve problem.

[ 1399.270341] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[ 1399.270345] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[ 1399.277331] [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake old ack to clear.

Comment 60 mikhail.v.gavrilov 2013-03-10 21:47:45 UTC

Created attachment 76293 [details]
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)

Comment 61 mikhail.v.gavrilov 2013-03-12 21:49:44 UTC

Created attachment 76448 [details]
i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on snb, v2)

Any updates?

Comment 62 Chris Wilson 2013-03-17 18:49:52 UTC

*** Bug 62443 has been marked as a duplicate of this bug. ***

Comment 63 Chris Wilson 2013-03-19 08:59:38 UTC

As a workaround, this

commit a24a11e6b4e96bca817f854e0ffcce75d3eddd13
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 14 17:52:05 2013 +0200

    drm/i915: Resurrect ring kicking for semaphores, selectively

should improve the recovery from the hangs.

Comment 64 clemens.brunner 2013-03-20 10:18:36 UTC

OK, I've been experiencing this bug from time to time on my Arch Linux box. No apparent reason, last time it happened I was watching a Youtube video, and it also seems to happen more often when I'm running VirtualBox. However, this might just be a coincidence.

Comment 65 LongerDev 2013-03-31 09:36:10 UTC

I have this bug too.

Gentoo 64bit
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09) (prog-if 00 [VGA controller])
        Subsystem: Samsung Electronics Co Ltd Device c0a0
        Flags: bus master, fast devsel, latency 0, IRQ 16
        Memory at f5c00000 (64-bit, non-prefetchable) [size=4M]
        Memory at e0000000 (64-bit, prefetchable) [size=256M]
        I/O ports at e000 [size=64]
        Expansion ROM at <unassigned> [disabled]
        Capabilities: <access denied>
        Kernel driver in use: i915

Kernel 3.8.0 gentoo-sources

I try patch a24a11e6b4e96bca817f854e0ffcce75d3eddd13, but nothing change.
Mar 31 15:14:37 localhost kernel: [64379.291736] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
Mar 31 15:14:37 localhost kernel: [64379.291742] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state

Comment 66 Mika Kuoppala 2013-04-05 12:52:57 UTC

Created attachment 77475 [details] [review]
[PATCH] drm/i915: Resurrect ring kicking for semaphores, selectively

Comment 67 Mika Kuoppala 2013-04-05 12:55:10 UTC

(In reply to comment #61)
> Created attachment 76448 [details]
> i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on
> snb, v2)
> 
> Any updates?

Mikhail,

Could you please try patch:
[PATCH] drm/i915: Resurrect ring kicking for semaphores, selectively

Comment 68 Daniel Vetter 2013-04-05 19:03:41 UTC

Patch is also included in latest drm-intel-nightly, linux-next. So you can test it by grabbing a distro-build of one of those.

Comment 69 mikhail.v.gavrilov 2013-04-09 20:59:41 UTC

(In reply to comment #67)
> (In reply to comment #61)
> > Created attachment 76448 [details]
> > i915_error_state (kernel 3.8.2 Fedora) with path (write mbox regs twice on
> > snb, v2)
> > 
> > Any updates?
> 
> Mikhail,
> 
> Could you please try patch:
> [PATCH] drm/i915: Resurrect ring kicking for semaphores, selectively

Hm, seems better but problem still here

[59120.008798] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[59120.008802] [drm] capturing error event; look for more information in /debug/dri/0/i915_error_state
[59120.012173] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring

Comment 70 mikhail.v.gavrilov 2013-04-09 21:01:13 UTC

Created attachment 77692 [details]
i915_error_state (kernel 3.8.5 Fedora) with path (drm/i915: Resurrect ring kicking for semaphores, selectively)

Comment 71 mikhail.v.gavrilov 2013-04-09 21:02:28 UTC

Created attachment 77693 [details]
dmesg (kernel 3.8.5 Fedora) with path (drm/i915: Resurrect ring kicking for semaphores, selectively)

Comment 72 Chris Wilson 2013-04-09 21:08:44 UTC

\o/ It kicked the right ring.

Comment 73 mikhail.v.gavrilov 2013-04-09 21:16:56 UTC

(In reply to comment #72)
> \o/ It kicked the right ring.

So is this normal?

Comment 74 Chris Wilson 2013-04-09 22:13:53 UTC

It's the expected 'improved' recovery behaviour for this bug.

Comment 75 Chris Wilson 2013-04-15 09:35:58 UTC

*** Bug 63542 has been marked as a duplicate of this bug. ***

Comment 76 Bryce Harrington 2013-04-22 23:55:08 UTC

Chris, what is the upstream status for the ring kicker patch?  Is that likely to get incorporated upstream, or do you feel it needs further polish before it's ready?  Would this patch incur some risk of regressions in other areas were it be backported for inclusion in Ubuntu?

Comment 77 Daniel Vetter 2013-04-23 14:57:39 UTC

(In reply to comment #76)
> Chris, what is the upstream status for the ring kicker patch?  Is that
> likely to get incorporated upstream, or do you feel it needs further polish
> before it's ready?  Would this patch incur some risk of regressions in other
> areas were it be backported for inclusion in Ubuntu?

Merged for 3.10 as

commit a24a11e6b4e96bca817f854e0ffcce75d3eddd13
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Mar 14 17:52:05 2013 +0200

    drm/i915: Resurrect ring kicking for semaphores, selectively

Nothing else planned for now, but I think we can just keep this bug here open in case we stumble across a new idea. And it seems to be good honey to attrack all the me,too reports ;-)

Comment 78 Tom Wijsman 2013-04-23 16:37:43 UTC

(In reply to comment #65)
> Kernel 3.8.0 gentoo-sources

Did you report this at the Gentoo Bugzilla?

When you do, please attach /debug/dri/0/i915_error_state

Comment 79 LongerDev 2013-04-29 12:13:16 UTC

>Did you report this at the Gentoo Bugzilla?

>When you do, please attach /debug/dri/0/i915_error_state

Now no report in gentoo bugzilla (so as in kernel they no have patches intel drivers). But now with it patch, I can't repeat bug 2 weeks on kernel 3.9-rc6. But I no test with blender (when I try use blender, GPU hung reapeted for 1-5 minutes).

Comment 80 Chris Wilson 2013-05-01 06:44:05 UTC

*** Bug 64094 has been marked as a duplicate of this bug. ***

Comment 81 mikhail.v.gavrilov 2013-05-01 07:04:06 UTC

Created attachment 78692 [details]
i915_error_state (kernel 3.9 Fedora)

Comment 82 mikhail.v.gavrilov 2013-05-01 07:04:33 UTC

Created attachment 78693 [details]
i915_error_state (kernel 3.9 Fedora)

Comment 83 Chris Wilson 2013-05-07 07:42:11 UTC

*** Bug 64094 has been marked as a duplicate of this bug. ***

Comment 84 Wojciech Kuranowski 2013-05-23 12:46:34 UTC

Created attachment 79704 [details]
i915_error_state - kernel 3.10-rc2, dual monitor, Dell E6430

I can reproduce this bug every time I try to quickly drag a Chrome window with a YouTube movie to a secondary monitor connected to my laptop Dell E6430. It is very annoying. Tested on latest kernel 3.10-rc2.

I can give you any additional information you want, test patches, etc. Just please try to fix this :)

Comment 85 Wojciech Kuranowski 2013-05-23 12:55:27 UTC

(In reply to comment #84)
> Created attachment 79704 [details]
> i915_error_state - kernel 3.10-rc2, dual monitor, Dell E6430
> 
> I can reproduce this bug every time I try to quickly drag a Chrome window
> with a YouTube movie to a secondary monitor connected to my laptop Dell
> E6430.

One more information - you need to enable "Override software rendering list" in chrome://flags

Comment 86 Christopher Wawak 2013-05-29 17:54:42 UTC

Created attachment 79979 [details]
i915_error_state - 3.9.2-201.rhbz879823.fc18.x86_64 (included patch write mbox regs twice on snb, v2)

Linux bobloblaw 3.9.2-201.rhbz879823.fc18.x86_64 #1 SMP Thu May 16 13:35:12 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

[45482.757631] [drm:i915_hangcheck_hung] *ERROR* Hangcheck timer elapsed... GPU hung
[45482.757645] [drm] capturing error event; look for more information in/sys/kernel/debug/dri/0/i915_error_state
[45482.766942] [drm:kick_ring] *ERROR* Kicking stuck semaphore on render ring
[45482.770617] [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake old ack to clear.

I added patch (drm/i915: Resurrect ring kicking for semaphores, selectively) to Fedora 18's 3.9.2-200 x86_64 kernel.

Comment 87 Christopher Wawak 2013-06-30 20:28:46 UTC

Is there any input or assistance I can give to help move this along? 

Thanks!

Comment 88 Chris Wilson 2013-07-20 21:33:52 UTC

Created attachment 82747 [details] [review]
New read-after-write patch

New patch for testing, thanks!

Comment 89 Chris Wilson 2013-07-20 21:35:45 UTC

Created attachment 82748 [details] [review]
New read-after-write patch

Comment 90 mikhail.v.gavrilov 2013-07-20 21:46:24 UTC

For which version of the kernel this patch?

Comment 91 LongerDev 2013-07-21 10:49:00 UTC

I tried it patch on linux-3.11_rc1, but when X starting I see:
791966 Jul 21 16:17:07 localhost kernel: [   19.320879] BUG: unable to handle kernel NULL pointer dereference at 0000000000000010
791967 Jul 21 16:17:07 localhost kernel: [   19.320948] IP: [<ffffffff8136bfc0>] gen6_add_request+0xe7/0x178
791968 Jul 21 16:17:07 localhost kernel: [   19.320995] PGD b0d80067 PUD b0c18067 PMD 0
791969 Jul 21 16:17:07 localhost kernel: [   19.321031] Oops: 0000 [#1] PREEMPT SMP
791970 Jul 21 16:17:07 localhost kernel: [   19.321064] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_intel snd_hda_codec brcmsmac snd_hwdep snd_p       cm cordic brcmutil bcma snd_page_alloc snd_timer snd soundcore
791971 Jul 21 16:17:07 localhost kernel: [   19.321209] CPU: 0 PID: 2696 Comm: X Not tainted 3.11.0-rc1 #1
791972 Jul 21 16:17:07 localhost kernel: [   19.321249] Hardware name: SAMSUNG ELECTRONICS CO., LTD. SF311/SF411/SF511/SF311/SF411/SF511, BIOS 06HW.M011.20110503.SCY 05       /03/2011
791973 Jul 21 16:17:07 localhost kernel: [   19.321322] task: ffff8800b1c07590 ti: ffff8800b0c24000 task.ti: ffff8800b0c24000
791974 Jul 21 16:17:07 localhost kernel: [   19.321370] RIP: 0010:[<ffffffff8136bfc0>]  [<ffffffff8136bfc0>] gen6_add_request+0xe7/0x178
791975 Jul 21 16:17:07 localhost kernel: [   19.321426] RSP: 0018:ffff8800b0c25bc8  EFLAGS: 00010286
791976 Jul 21 16:17:07 localhost kernel: [   19.321461] RAX: 0000000000000000 RBX: ffff8800b1c3d4d8 RCX: 0000000000027330
791977 Jul 21 16:17:07 localhost kernel: [   19.321506] RDX: 0000000000000080 RSI: ffffc900045c003c RDI: ffffc900045c0038
791978 Jul 21 16:17:07 localhost kernel: [   19.321550] RBP: ffff8800b0c25c08 R08: ffff8800b0d97f00 R09: 00000000000145c0
791979 Jul 21 16:17:07 localhost kernel: [   19.321594] R10: 0000000000001000 R11: ffff8800b1c3c000 R12: 0000000000000000
791980 Jul 21 16:17:07 localhost kernel: [   19.321638] R13: 0000000000002044 R14: 0000000000000000 R15: ffff8800b1c3c000
791981 Jul 21 16:17:07 localhost kernel: [   19.321682] FS:  00007ff167ae8880(0000) GS:ffff880100200000(0000) knlGS:0000000000000000
791982 Jul 21 16:17:07 localhost kernel: [   19.321732] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
791983 Jul 21 16:17:07 localhost kernel: [   19.321767] CR2: 0000000000000010 CR3: 00000000b1cc9000 CR4: 00000000000407f0
791984 Jul 21 16:17:07 localhost kernel: [   19.321810] Stack:
791985 Jul 21 16:17:07 localhost kernel: [   19.321824]  ffff8800b1c3d4d8 0000000000000000 ffff8800aff24000 0000000000000000
791986 Jul 21 16:17:07 localhost kernel: [   19.321876]  ffff8800b1c3c000 ffff8800b0d97f00 ffff8800b1f66a00 ffff8800b1c3d4d8
791987 Jul 21 16:17:07 localhost kernel: [   19.321927]  ffff8800b0c25c68 ffffffff81334b11 ffff880000000028 0000000000000000
791988 Jul 21 16:17:07 localhost kernel: [   19.321979] Call Trace:
791989 Jul 21 16:17:07 localhost kernel: [   19.322000]  [<ffffffff81334b11>] __i915_add_request+0x6d/0x215
791990 Jul 21 16:17:07 localhost kernel: [   19.322045]  [<ffffffff8133b8d9>] i915_gem_do_execbuffer.isra.14+0xd07/0xdc5
791991 Jul 21 16:17:07 localhost kernel: [   19.322089]  [<ffffffff8133bd5e>] ? i915_gem_execbuffer2+0x5d/0x1e3
791992 Jul 21 16:17:07 localhost kernel: [   19.322128]  [<ffffffff8133be5a>] i915_gem_execbuffer2+0x159/0x1e3
791993 Jul 21 16:17:07 localhost kernel: [   19.322170]  [<ffffffff8130e167>] drm_ioctl+0x302/0x446
791994 Jul 21 16:17:07 localhost kernel: [   19.322204]  [<ffffffff8133bd01>] ? i915_gem_execbuffer+0x36a/0x36a
791995 Jul 21 16:17:07 localhost kernel: [   19.322245]  [<ffffffff8102a823>] ? __do_page_fault+0x34f/0x3f3
791996 Jul 21 16:17:07 localhost kernel: [   19.322285]  [<ffffffff810d3621>] vfs_ioctl+0x21/0x34
791997 Jul 21 16:17:07 localhost kernel: [   19.322317]  [<ffffffff810d3e7a>] do_vfs_ioctl+0x3b8/0x3fb
791998 Jul 21 16:17:07 localhost kernel: [   19.322353]  [<ffffffff810dbab9>] ? fget_light+0xa1/0xb8
791999 Jul 21 16:17:07 localhost kernel: [   19.322387]  [<ffffffff810d3efd>] SyS_ioctl+0x40/0x6b
792000 Jul 21 16:17:07 localhost kernel: [   19.322420]  [<ffffffff816450d2>] system_call_fastpath+0x16/0x1b
792001 Jul 21 16:17:07 localhost kernel: [   19.322457] Code: e8 d4 c0 f0 ff 8b 73 2c 44 89 ef 83 c6 04 89 73 2c 48 03 73 10 e8 bf c0 f0 ff 8b 73 2c 48 8b 45 c8 83 c6 0       4 89 73 2c 48 03 73 10 <8b> 78 10 83 ef 80 e8 a3 c0 f0 ff 83 43 2c 04 49 ff c4 49 83 fc
792002 Jul 21 16:17:07 localhost kernel: [   19.322688] RIP  [<ffffffff8136bfc0>] gen6_add_request+0xe7/0x178
792003 Jul 21 16:17:07 localhost kernel: [   19.322728]  RSP <ffff8800b0c25bc8>
792004 Jul 21 16:17:07 localhost kernel: [   19.322750] CR2: 0000000000000010
792005 Jul 21 16:17:07 localhost kernel: [   19.330669] ---[ end trace b13215eb98a2df5f ]---

Comment 92 Chris Wilson 2013-07-21 11:05:03 UTC

Created attachment 82768 [details] [review]
New read-after-write patch

Oops, my mistake, please try again.

Comment 93 LongerDev 2013-07-21 11:38:42 UTC

Created attachment 82773 [details]
i915_error_state with new patch

(In reply to comment #92)
> Created attachment 82768 [details] [review] [review]
> New read-after-write patch
> 
> Oops, my mistake, please try again.

Now loading, but after five minutes test:
793485 Jul 21 17:32:56 localhost kernel: [  321.432882] hda-intel 0000:00:1b.0: Unstable LPIB (32740 >= 4096); disabling LPIB delay counting
793486 Jul 21 17:34:49 localhost kernel: [  434.291085] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
793487 Jul 21 17:34:49 localhost kernel: [  434.291088] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
793488 Jul 21 17:34:49 localhost kernel: [  434.307124] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xbfe2000 ctx 1) at 0xbfe21dc

Comment 94 Chris Wilson 2013-07-21 11:42:36 UTC

(In reply to comment #93)
> Created attachment 82773 [details]
> i915_error_state with new patch
> 
> (In reply to comment #92)
> > Created attachment 82768 [details] [review] [review] [review]
> > New read-after-write patch
> > 
> > Oops, my mistake, please try again.
> 
> Now loading, but after five minutes test:
> 793485 Jul 21 17:32:56 localhost kernel: [  321.432882] hda-intel
> 0000:00:1b.0: Unstable LPIB (32740 >= 4096); disabling LPIB delay counting
> 793486 Jul 21 17:34:49 localhost kernel: [  434.291085]
> [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> 793487 Jul 21 17:34:49 localhost kernel: [  434.291088] [drm] capturing
> error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_state
> 793488 Jul 21 17:34:49 localhost kernel: [  434.307124]
> [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xbfe2000
> ctx 1) at 0xbfe21dc

That is a blorp (mesa/i965) bug and not the semaphore deadlock.

Comment 95 Chris Wilson 2013-08-11 11:53:14 UTC

Will someone please try https://bugs.freedesktop.org/attachment.cgi?id=82768 with a working mesa! :)

Comment 96 Andy Lutomirski 2013-08-24 01:49:55 UTC

The patch seems to have helped -- my box survived a couple days with the patch applied.

Comment 97 Chris Wilson 2013-08-25 13:25:37 UTC

The bad news is that I've just had the semaphore hang with all the read-after-write patch applied. :|

Comment 98 Janusz 2013-09-03 20:18:48 UTC

(In reply to comment #94)
> (In reply to comment #93)
> > Created attachment 82773 [details]
> > i915_error_state with new patch
> > 
> > (In reply to comment #92)
> > > Created attachment 82768 [details] [review] [review] [review] [review]
> > > New read-after-write patch
> > > 
> > > Oops, my mistake, please try again.
> > 
> > Now loading, but after five minutes test:
> > 793485 Jul 21 17:32:56 localhost kernel: [  321.432882] hda-intel
> > 0000:00:1b.0: Unstable LPIB (32740 >= 4096); disabling LPIB delay counting
> > 793486 Jul 21 17:34:49 localhost kernel: [  434.291085]
> > [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> > 793487 Jul 21 17:34:49 localhost kernel: [  434.291088] [drm] capturing
> > error event; look for more information in
> > /sys/kernel/debug/dri/0/i915_error_state
> > 793488 Jul 21 17:34:49 localhost kernel: [  434.307124]
> > [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xbfe2000
> > ctx 1) at 0xbfe21dc
> 
> That is a blorp (mesa/i965) bug and not the semaphore deadlock.
Could you please provide some link to this blorp bug report?
I had problem with semaphore deadlock, seems that with kernel 3.11 problem does not occur (without patch), but now I have:

[22221.843000] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[22221.843483] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x4dfb5000 ctx 1) at 0x4dfb5518

Comment 99 Chris Wilson 2013-09-04 00:27:20 UTC

*** Bug 68913 has been marked as a duplicate of this bug. ***

Comment 100 Dan Doel 2013-09-08 17:09:22 UTC

I have, I think, a reliable way to trigger this behavior, if that helps. It requires a non-trivial setup, though.

I have gnome-shell running on dual monitors. The first is 1920x1200, the second is 1920x1080 (not sure if the resolution difference matters). If I run a full-screen game on The 1920x1200 monitor, I get freezes, and notes in the dmesg about hangcheck timers and kickrings ("stuck wait on blitter ring").

I believe OpenGL acceleration of the desktop is important, because the freezes are not triggered in fluxbox, for instance. I'm not sure if the game itself needs to be using OpenGL, or if the full-screen window is the triggering factor, or something else entirely. It is important that the game keep the monitors distinct, and only go full screen on one. I just tried it on Battle for Wesnoth, and full screen there sets the monitors to mirror, which doesn't trigger the problem.

This is on an i7 4770, if that matters.

I realize this is may be difficult to put together for a test setup, but I thought I'd mention it.

Comment 101 Janusz 2013-09-08 17:25:55 UTC

(In reply to comment #100)
> I have, I think, a reliable way to trigger this behavior, if that helps. It
> requires a non-trivial setup, though.
> 
> I have gnome-shell running on dual monitors. The first is 1920x1200, the
> second is 1920x1080 (not sure if the resolution difference matters). If I
> run a full-screen game on The 1920x1200 monitor, I get freezes, and notes in
> the dmesg about hangcheck timers and kickrings ("stuck wait on blitter
> ring").
> 
> I believe OpenGL acceleration of the desktop is important, because the
> freezes are not triggered in fluxbox, for instance. I'm not sure if the game
> itself needs to be using OpenGL, or if the full-screen window is the
> triggering factor, or something else entirely. It is important that the game
> keep the monitors distinct, and only go full screen on one. I just tried it
> on Battle for Wesnoth, and full screen there sets the monitors to mirror,
> which doesn't trigger the problem.
> 
> This is on an i7 4770, if that matters.
> 
> I realize this is may be difficult to put together for a test setup, but I
> thought I'd mention it.

I also have dual monitors and also gnome-shell, but I have on both 1920x1080px. I notice that when I am watching some videos on full screen on one monitor, this is happening more often (on non full-screen work, it's still happening)

Comment 102 Chris Wilson 2013-09-08 17:26:04 UTC

(In reply to comment #100) 
> This is on an i7 4770, if that matters.

No, that's something completely new. Please open a new bug report and attach your dmesg, Xorg.0.log and /sys/drm/card0/error from after one of the hangs.

Comment 103 yjcoshc 2013-10-04 09:09:24 UTC

Created attachment 87101 [details]
i915_error_state (kernel 3.11.3)

Comment 104 yjcoshc 2013-10-04 09:12:35 UTC

After playing hedgewars for about half an hour, the gpu started to hang.
dmesg output:
[ 3442.907459] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[ 3442.907471] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[ 3442.916792] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x5e52000 ctx 1) at 0x5e52220
[ 3466.911077] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[ 3466.911087] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
[ 3466.947069] [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for forcewake old ack to clear.
I'm not sure my problem is related to this bug.

Comment 105 yjcoshc 2013-10-04 09:13:57 UTC

(In reply to comment #104)
> After playing hedgewars for about half an hour, the gpu started to hang.
> dmesg output:
> [ 3442.907459] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> [ 3442.907471] [drm] capturing error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_state
> [ 3442.916792] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x5e52000 ctx 1) at 0x5e52220
> [ 3466.911077] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> [ 3466.911087] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
> [ 3466.947069] [drm:__gen6_gt_force_wake_get] *ERROR* Timed out waiting for
> forcewake old ack to clear.
> I'm not sure my problem is related to this bug.

My laptop is Thinkpad T420 with i5-2520M. The BIOS version is 1.44.

Comment 106 Daniel Vetter 2013-10-04 09:21:20 UTC

(In reply to comment #104)
> I'm not sure my problem is related to this bug.

Most likely it isn't - gpu hang is similar to an application crashing. Please file a new bug report and don't forget to attach the error state file. That's the first thing we need to triage the bug.

And of course list the versions of all the userspace driver parts (mesa, ddx, ...) since like a normal application crash most often it's not a kernel bug, but a bug in the render commands submitted by userspace to the gpu.

Comment 107 LongerDev 2013-10-04 09:28:37 UTC

(In reply to comment #106)
> (In reply to comment #104)
> > I'm not sure my problem is related to this bug.
> 
> Most likely it isn't - gpu hang is similar to an application crashing.
> Please file a new bug report and don't forget to attach the error state
> file. That's the first thing we need to triage the bug.
> 
> And of course list the versions of all the userspace driver parts (mesa,
> ddx, ...) since like a normal application crash most often it's not a kernel
> bug, but a bug in the render commands submitted by userspace to the gpu.

Why userspace drivers can breaking render and calling error in kernel part of driver? May be can add "filter" sent commands and ignore (or other reaction, but not execute their) their?

Comment 108 Chris Wilson 2013-10-04 09:35:29 UTC

(In reply to comment #107)
> (In reply to comment #106)
> > (In reply to comment #104)
> > > I'm not sure my problem is related to this bug.
> > 
> > Most likely it isn't - gpu hang is similar to an application crashing.
> > Please file a new bug report and don't forget to attach the error state
> > file. That's the first thing we need to triage the bug.
> > 
> > And of course list the versions of all the userspace driver parts (mesa,
> > ddx, ...) since like a normal application crash most often it's not a kernel
> > bug, but a bug in the render commands submitted by userspace to the gpu.
> 
> Why userspace drivers can breaking render and calling error in kernel part
> of driver? May be can add "filter" sent commands and ignore (or other
> reaction, but not execute their) their?

The GPU is a full Turing complete computational engine (in fact, lots of them coupled in parallel and in series), see http://xkcd.com/1266/

Comment 109 yjcoshc 2013-10-05 15:08:58 UTC

(In reply to comment #106)
> (In reply to comment #104)
> > I'm not sure my problem is related to this bug.
> 
> Most likely it isn't - gpu hang is similar to an application crashing.
> Please file a new bug report and don't forget to attach the error state
> file. That's the first thing we need to triage the bug.
> 
> And of course list the versions of all the userspace driver parts (mesa,
> ddx, ...) since like a normal application crash most often it's not a kernel
> bug, but a bug in the render commands submitted by userspace to the gpu.

Someone has reported it here.
https://bugs.freedesktop.org/show_bug.cgi?id=70151

Comment 110 Jan Jurko 2013-10-08 18:19:52 UTC

Hello. Same problem here.

[  485.443455] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[  485.443467] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[  485.452727] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0xa637000 ctx 1) at 0xa6371c8
[  821.726799] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[  821.726873] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x4974000 ctx 1) at 0x49741c8
[ 1311.134514] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[ 1311.134613] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x4a98000 ctx 1) at 0x4a98220

sys: fedora 19 64b
Linux jarvis 3.11.2-201.fc19.x86_64 #1 SMP Fri Sep 27 19:20:55 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

WM: KDE with effects enabled

8G ram
300G SATA HDD
ntb Lenovo ThinkPad E320

problem occurs in:
- scrolling in firefox
- playing video in vlc and switch to KDE terminal or another app
- sometimes system hangs, cpu 100%, freeze and hard reboot needed
- sometimes happens if I work with ff or in terminal only (very frustrating)
- happening across many kernel versions 3.0 to newest I think


lspci
00:00.0 Host bridge: Intel Corporation 2nd Generation Core Processor Family DRAM Controller (rev 09)
00:02.0 VGA compatible controller: Intel Corporation 2nd Generation Core Processor Family Integrated Graphics Controller (rev 09)
00:16.0 Communication controller: Intel Corporation 6 Series/C200 Series Chipset Family MEI Controller #1 (rev 04)
00:1a.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #2 (rev 04)
00:1b.0 Audio device: Intel Corporation 6 Series/C200 Series Chipset Family High Definition Audio Controller (rev 04)
00:1c.0 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 1 (rev b4)
00:1c.1 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 2 (rev b4)
00:1c.2 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 3 (rev b4)
00:1c.5 PCI bridge: Intel Corporation 6 Series/C200 Series Chipset Family PCI Express Root Port 6 (rev b4)
00:1d.0 USB controller: Intel Corporation 6 Series/C200 Series Chipset Family USB Enhanced Host Controller #1 (rev 04)
00:1f.0 ISA bridge: Intel Corporation HM65 Express Chipset Family LPC Controller (rev 04)
00:1f.2 SATA controller: Intel Corporation 6 Series/C200 Series Chipset Family 6 port SATA AHCI Controller (rev 04)
00:1f.3 SMBus: Intel Corporation 6 Series/C200 Series Chipset Family SMBus Controller (rev 04)
02:00.0 Network controller: Intel Corporation Centrino Wireless-N 1000 [Condor Peak]
03:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5209 PCI Express Card Reader (rev 01)
03:00.1 SD Host controller: Realtek Semiconductor Co., Ltd. RTS5209 PCI Express Card Reader (rev 01)
08:00.0 Ethernet controller: Qualcomm Atheros AR8151 v2.0 Gigabit Ethernet (rev c0)

Comment 111 Daniel Vetter 2013-10-08 19:45:39 UTC

(In reply to comment #110)
> Hello. Same problem here.
> 
> [  485.443455] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> [  485.443467] [drm] capturing error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_state
> [  485.452727] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0xa637000 ctx 1) at 0xa6371c8

Unlikey that this is the same gpu hang. Please file a new bug report and attach the error state.

Comment 112 Mathias Dietrich 2013-10-15 15:29:34 UTC

Just a few remarks.
I still see this bug with Kernel 3.8, Mesa 9.2.1 and DRI 2.99.904.
Moreover, with switching from Mesa 9.1.x to Mesa 9.2.x the number of lockups highly increased (especially in games).
Additionally with running the latest drivers complete system lockups are gone, but it's still a lockup for multiple seconds with following VT switching.
Maybe these observations help somehow.

Comment 113 Daniel Vetter 2013-10-16 08:20:09 UTC

(In reply to comment #112)
> Just a few remarks.
> I still see this bug with Kernel 3.8, Mesa 9.2.1 and DRI 2.99.904.
> Moreover, with switching from Mesa 9.1.x to Mesa 9.2.x the number of lockups
> highly increased (especially in games).

On snb the blorp engine in mesa has become a bit more hang-happy, see bug #70151
Not all gpu hangs are created equal ;-)

> Additionally with running the latest drivers complete system lockups are
> gone, but it's still a lockup for multiple seconds with following VT
> switching.

You mean a gpu hang happens while when doing a vt switch?

Comment 114 Mathias Dietrich 2013-10-16 08:57:39 UTC

(In reply to comment #113)
> On snb the blorp engine in mesa has become a bit more hang-happy, see bug
> #70151
> Not all gpu hangs are created equal ;-)
> 

Actually it was on Sandybridge.

> You mean a gpu hang happens while when doing a vt switch?

No I meant, if you suffer a lockup you just have to wait a few seconds and switch to another VT and back, then you can resume with your system (although sometimes fonts are broken).

Comment 115 bay 2013-10-19 17:39:13 UTC

Created attachment 87857 [details]
i915_error_state

I also met this bug while I was watching video in mplayer. It every 1-2 hours.

[40787.765816] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[40787.765852] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[40787.772361] [drm:i915_set_reset_status] *ERROR* render ring hung inside bo (0x1fb63000 ctx 1) at 0x1fb63220

Comment 116 bay 2013-10-19 17:45:15 UTC

Created attachment 87858 [details]
X -version output

Comment 117 Daniel Vetter 2013-10-19 18:22:57 UTC

(In reply to comment #115)
> Created attachment 87857 [details]
> i915_error_state
> 
> I also met this bug while I was watching video in mplayer. It every 1-2
> hours.
> 
> [40787.765816] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
> [40787.765852] [drm] capturing error event; look for more information in
> /sys/kernel/debug/dri/0/i915_error_state
> [40787.772361] [drm:i915_set_reset_status] *ERROR* render ring hung inside
> bo (0x1fb63000 ctx 1) at 0x1fb63220

This looks like bug #70151, but is definitely not this bug here.

Comment 118 yjcoshc 2013-11-16 12:15:47 UTC

Created attachment 89314 [details]
i915_error_state (kernel 3.11.6, mesa 9.2.2, xf86-video-intel 2.99.906)

GPU hangs after playing hedgewars for a few minutes. Thinkpad T420 laptop, i5-2520M.
dmesg error message:
[16901.286432] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[16901.286441] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring
[16901.286444] [drm] capturing error event; look for more information in /sys/kernel/debug/dri/0/i915_error_state
[16908.287504] [drm:i915_hangcheck_elapsed] *ERROR* stuck on render ring
[16908.287508] [drm:i915_hangcheck_elapsed] *ERROR* stuck on blitter ring

Comment 119 Kenneth Graunke 2013-11-21 22:17:59 UTC

*** Bug 71890 has been marked as a duplicate of this bug. ***

Comment 120 Chris Wilson 2013-11-26 19:49:47 UTC

*** Bug 72048 has been marked as a duplicate of this bug. ***

Comment 121 Chris Wilson 2013-12-18 10:36:06 UTC

*** Bug 72829 has been marked as a duplicate of this bug. ***

Comment 122 Chris Wilson 2014-01-15 12:01:12 UTC

*** Bug 73659 has been marked as a duplicate of this bug. ***

Comment 123 David 2014-01-24 03:58:26 UTC

Created attachment 92710 [details]
i915_error_state

I'm also getting regular Sandybridge GPU lockups with Mesa 10.0.1 and Linux kernel 3.13. 

dmesg output:

[  918.876872] [drm] stuck on render ring
[  918.876876] [drm] stuck on blitter ring
[  918.876878] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[  918.876879] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[  918.876879] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[  918.876880] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[  918.876880] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[  932.923240] [drm] stuck on render ring
[  932.923242] [drm] stuck on blitter ring

Unfortunately the crash dump doesn't help - it's an empty file!

Comment 124 Chris Wilson 2014-01-29 13:30:11 UTC

*** Bug 74180 has been marked as a duplicate of this bug. ***

Comment 125 Chris Wilson 2014-01-31 09:16:29 UTC

*** Bug 74265 has been marked as a duplicate of this bug. ***

Comment 126 Chris Wilson 2014-02-03 20:32:13 UTC

*** Bug 74452 has been marked as a duplicate of this bug. ***

Comment 127 Chris Wilson 2014-02-03 20:34:11 UTC

*** Bug 74473 has been marked as a duplicate of this bug. ***

Comment 128 Chris Wilson 2014-02-12 08:49:11 UTC

*** Bug 74867 has been marked as a duplicate of this bug. ***

Comment 129 Chris Wilson 2014-02-18 16:26:59 UTC

*** Bug 75163 has been marked as a duplicate of this bug. ***

Comment 130 Simon Farnsworth 2014-03-04 12:46:11 UTC

Created attachment 95090 [details]
Another version of the same hang - directed here from bug 75502

Comment 131 Chris Wilson 2014-03-10 21:05:47 UTC

*** Bug 75999 has been marked as a duplicate of this bug. ***

Comment 132 Chris Wilson 2014-03-20 22:05:09 UTC

*** Bug 76408 has been marked as a duplicate of this bug. ***

Comment 133 Chris Wilson 2014-03-27 10:14:56 UTC

*** Bug 76677 has been marked as a duplicate of this bug. ***

Comment 134 Chris Wilson 2014-03-30 17:49:12 UTC

*** Bug 76801 has been marked as a duplicate of this bug. ***

Comment 135 Phil Turmel 2014-03-30 18:07:36 UTC

For what its worth, running 3.13.7 greatly mitigates this bug, to where the dead time is barely noticeable.  It happened three times in short order here and I didn't notice any of them:

[ 4562.551141] [drm:ring_stuck] *ERROR* Kicking stuck semaphore on render ring
[ 4582.530028] [drm:ring_stuck] *ERROR* Kicking stuck semaphore on render ring
[ 4633.476199] [drm:ring_stuck] *ERROR* Kicking stuck semaphore on render ring

Comment 136 Chris Wilson 2014-04-04 08:11:42 UTC

*** Bug 77043 has been marked as a duplicate of this bug. ***

Comment 137 Chris Wilson 2014-04-04 15:50:51 UTC

*** Bug 77058 has been marked as a duplicate of this bug. ***

Comment 138 Phil Turmel 2014-04-05 16:41:41 UTC

My stuck ring faults are completely gone with i915.i915_enable_rc6=0.  Fan stays on a bit more (subjectively) seems to be the only side effect.  HP Pavilion dv6 (Sandybridge).

Comment 139 Chris Wilson 2014-04-05 20:26:43 UTC

Oh that's interesting. We might be able to find a register to prevent rc6 whilst waiting on a semaphore. (Hmm, too bad it isn't ivb or we could just frob forcewake directly.)

Comment 140 Phil Turmel 2014-04-06 01:05:06 UTC

(In reply to comment #139)
> Oh that's interesting. We might be able to find a register to prevent rc6
> whilst waiting on a semaphore. (Hmm, too bad it isn't ivb or we could just
> frob forcewake directly.)

Happy to test patches.  I'm updating to 3.13.9 tonight.  I could add something on top if you have ideas.  If you need more info than my attachment to #76801 just let me know.

Comment 141 Chris Wilson 2014-04-07 17:24:37 UTC

*** Bug 77147 has been marked as a duplicate of this bug. ***

Comment 142 Chris Wilson 2014-04-30 10:40:54 UTC

*** Bug 77974 has been marked as a duplicate of this bug. ***

Comment 143 Chris Wilson 2014-05-06 05:29:30 UTC

*** Bug 78317 has been marked as a duplicate of this bug. ***

Comment 144 Artjom Simon 2014-05-06 21:52:53 UTC

Created attachment 98589 [details]
Kernel 3.14.2-1-ARCH, xf86-video-intel 2.99.911-2, mesa 10.1.2-1

Comment 145 Chris Wilson 2014-05-16 14:42:30 UTC

*** Bug 78785 has been marked as a duplicate of this bug. ***

Comment 146 Chris Wilson 2014-06-01 11:55:25 UTC

*** Bug 79500 has been marked as a duplicate of this bug. ***

Comment 147 Chris Wilson 2014-06-04 16:03:30 UTC

*** Bug 79640 has been marked as a duplicate of this bug. ***

Comment 148 Jani Nikula 2014-06-10 16:37:49 UTC

commit ca79d888eb63cdacf80653ae23ce8f7d9ac52c68
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jun 6 10:22:29 2014 +0100

    drm/i915: Reorder semaphore deadlock check

Comment 149 Chris Wilson 2014-06-15 16:18:55 UTC

*** Bug 80055 has been marked as a duplicate of this bug. ***

Comment 150 Chris Wilson 2014-06-17 08:54:50 UTC

*** Bug 80125 has been marked as a duplicate of this bug. ***

Comment 151 Chris Wilson 2014-06-17 22:31:15 UTC

*** Bug 80168 has been marked as a duplicate of this bug. ***

Comment 152 Chris Wilson 2014-06-23 13:56:49 UTC

*** Bug 80401 has been marked as a duplicate of this bug. ***

Comment 153 Chris Wilson 2014-06-27 11:40:37 UTC

*** Bug 80592 has been marked as a duplicate of this bug. ***

Comment 154 Chris Wilson 2014-07-05 06:09:34 UTC

*** Bug 80935 has been marked as a duplicate of this bug. ***

Comment 155 Chris Wilson 2014-07-08 19:39:09 UTC

*** Bug 81064 has been marked as a duplicate of this bug. ***

Comment 156 Kurt Roeckx 2014-07-08 19:44:41 UTC

Can someone indicate what the current status of this is?

Comment 157 Yun-Fong Loh 2014-07-08 21:06:51 UTC

I haven't seen it with xorg-x11-drv-intel-2.99.912-4 (built for fc20) from kojipkgs.

Comment 158 Kurt Roeckx 2014-07-08 21:28:12 UTC

I'm using 2.21.15 which as far as I know is the latest release.

Comment 159 Andre Robatino 2014-07-11 19:28:58 UTC

I am seeing

[drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... render ring idle

followed by a graphics freeze and the need to reboot (if I can) in Fedora 20 with the latest updates including the 3.15.4 kernel.

Comment 160 Chris Wilson 2014-07-16 06:11:25 UTC

*** Bug 81402 has been marked as a duplicate of this bug. ***

Comment 161 Matteo Croce 2014-07-16 16:10:07 UTC

same happens with 3.15.0 on Ubuntu 14.04 64 bit

Jul 11 12:43:41 localhost kernel: [42049.462542] [drm] stuck on render ring
Jul 11 12:43:41 localhost kernel: [42049.463330] [drm] GPU HANG: ecode 0:0x00ffffff, in chrome [2172], reason: Ring hung, action: reset
Jul 11 12:43:41 localhost kernel: [42049.463334] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
Jul 11 12:43:41 localhost kernel: [42049.463335] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
Jul 11 12:43:41 localhost kernel: [42049.463336] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
Jul 11 12:43:41 localhost kernel: [42049.463337] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
Jul 11 12:43:41 localhost kernel: [42049.463338] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Jul 11 12:43:43 localhost kernel: [42051.464623] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
Jul 11 12:43:47 localhost kernel: [42055.468816] [drm] stuck on render ring
Jul 11 12:43:47 localhost kernel: [42055.469614] [drm] GPU HANG: ecode 0:0x00ffffff, in chrome [2172], reason: Ring hung, action: reset
Jul 11 12:43:49 localhost kernel: [42057.470899] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
Jul 11 12:43:53 localhost kernel: [42061.439056] [drm] stuck on render ring
Jul 11 12:43:53 localhost kernel: [42061.439867] [drm] GPU HANG: ecode 0:0xfeffffff, in chrome [2172], reason: Ring hung, action: reset

Comment 162 Christopher Wawak 2014-07-17 16:17:10 UTC

[872948.822279] [drm] stuck on render ring
[872948.822291] [drm] stuck on blitter ring
[872948.823041] [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [30647], reason: Ring hung, action: reset
[872948.823045] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[872948.823046] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[872948.823047] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[872948.823048] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[872948.823049] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[872948.823168] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
[872950.821912] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

Linux bobloblaw 3.15.0-1.fc20.x86_64 #1 SMP Sat Jun 14 11:22:00 EDT 2014 x86_64 x86_64 x86_64 GNU/Linux

Attaching gpu crash dump as card0-error.071714-cwawak

Comment 163 Christopher Wawak 2014-07-17 16:18:43 UTC

Created attachment 102991 [details]
card0-error.071714-cwawak - gpu dump

Comment 164 Chris Wilson 2014-07-23 13:56:36 UTC

*** Bug 81673 has been marked as a duplicate of this bug. ***

Comment 165 Chris Wilson 2014-07-23 14:54:59 UTC

*** Bug 81676 has been marked as a duplicate of this bug. ***

Comment 166 Chris Wilson 2014-07-24 13:45:27 UTC

*** Bug 81710 has been marked as a duplicate of this bug. ***

Comment 167 Chris Wilson 2014-07-28 19:14:03 UTC

*** Bug 81844 has been marked as a duplicate of this bug. ***

Comment 168 Chris Wilson 2014-08-01 06:44:21 UTC

*** Bug 81990 has been marked as a duplicate of this bug. ***

Comment 169 Chris Wilson 2014-08-07 05:43:52 UTC

*** Bug 82277 has been marked as a duplicate of this bug. ***

Comment 170 Chris Wilson 2014-08-07 15:49:15 UTC

*** Bug 82301 has been marked as a duplicate of this bug. ***

Comment 171 Chris Wilson 2014-08-10 05:06:56 UTC

*** Bug 82399 has been marked as a duplicate of this bug. ***

Comment 172 Chris Wilson 2014-08-11 08:25:15 UTC

*** Bug 82451 has been marked as a duplicate of this bug. ***

Comment 173 Michael Stahl 2014-08-14 19:23:59 UTC

*** Bug 82620 has been marked as a duplicate of this bug. ***

Comment 174 Chris Wilson 2014-08-15 06:17:19 UTC

*** Bug 82631 has been marked as a duplicate of this bug. ***

Comment 175 Chris Wilson 2014-08-15 20:44:07 UTC

*** Bug 82666 has been marked as a duplicate of this bug. ***

Comment 176 Chris Wilson 2014-08-16 07:01:31 UTC

*** Bug 82691 has been marked as a duplicate of this bug. ***

Comment 177 Chris Wilson 2014-08-21 09:28:46 UTC

*** Bug 82901 has been marked as a duplicate of this bug. ***

Comment 178 Chris Wilson 2014-08-26 11:13:12 UTC

*** Bug 83098 has been marked as a duplicate of this bug. ***

Comment 179 Chris Wilson 2014-08-27 17:51:12 UTC

*** Bug 83156 has been marked as a duplicate of this bug. ***

Comment 180 Chris Wilson 2014-09-01 06:01:47 UTC

*** Bug 83326 has been marked as a duplicate of this bug. ***

Comment 181 Chris Wilson 2014-09-04 05:48:20 UTC

*** Bug 83473 has been marked as a duplicate of this bug. ***

Comment 182 Chris Wilson 2014-09-09 10:55:20 UTC

*** Bug 83661 has been marked as a duplicate of this bug. ***

Comment 183 Manuel Widmer 2014-09-09 20:27:35 UTC

Is there any ongoing development to fix this bug? I still see it with 
Linux <hostname> 3.13.0-35-generic #62-Ubuntu SMP Fri Aug 15 01:58:42 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux

And the latest intel drivers as provided by intel linux graphics installer from 
https://01.org/linuxgraphics/

Many times my system freezes few minutes after starting to watch a movie with vlc. I have my screen connected through a receiver (hdmi for audio + video) with the linux system. The probability for a freeze is higher when the hdmi receiver was powered of for some time before playing the movie than when I do a reboot and hdmi is always on.

I'm happy to help with crashdumps as far as I'm able to collect them.

Comment 184 Bartosz Brachaczek 2014-09-09 20:49:34 UTC

(In reply to comment #183)

I recommend configuring i915.semaphores=0. I did it and it doesn't freeze anymore.

Comment 185 Chris Wilson 2014-09-10 16:01:30 UTC

*** Bug 83721 has been marked as a duplicate of this bug. ***

Comment 186 Chris Wilson 2014-09-12 05:49:34 UTC

*** Bug 83783 has been marked as a duplicate of this bug. ***

Comment 187 f.st 2014-09-13 09:47:37 UTC

Hi Chris,

meanwhile my current kernel is 3.16.1-46.1.g90bc0f1
I'm wondering (after a reinstall) that the semaphore bug hasn't occured yet, which was the case before (after a fresh install).

This leads me to 4 definable possible reasons:

1. the named kernel revision somehow contains a fix for it. looking at the changes I could'nt get an affirmation to that assumption.
2. cgroup_memory=disabled has a relation to it. (That's why I removed it for now).
3. the BIOS settings (which could be different now) might have something to do with it.
4. I haven't installed KVM suppport yet.


I'll post again if I find a reproducible explanation.
Frank

Comment 188 f.st 2014-09-13 10:38:09 UTC

2. of course I meant cgroup_disable=memory

Comment 189 f.st 2014-09-13 15:56:05 UTC

Hi Chris,

OK, nothing of the above was the reason. In my case it's simply this:

/etc/X11/xorg.conf.d/20-intel.conf

Section "Device"
   Identifier  "Intel Graphics"
   Driver      "intel"
   Option      "TearFree"    "true"
EndSection


I added it when the tearing scrolling through large webpages annoyed me.
As soon as I added it, the problems quickly started.

Selfmade problem.

Frank

Comment 190 Chris Wilson 2014-09-13 16:26:37 UTC

(In reply to comment #189)
> Hi Chris,
> 
> OK, nothing of the above was the reason. In my case it's simply this:
> 
> /etc/X11/xorg.conf.d/20-intel.conf
> 
> Section "Device"
>    Identifier  "Intel Graphics"
>    Driver      "intel"
>    Option      "TearFree"    "true"
> EndSection
> 
> 
> I added it when the tearing scrolling through large webpages annoyed me.
> As soon as I added it, the problems quickly started.
> 
> Selfmade problem.

Not really, https://bugs.freedesktop.org/show_bug.cgi?id=70764 tracks that this hang is more likely with TearFree (fundamentally the hang is still the same hardware issue, but it is interesting that TearFree has a higher chance of hitting it).

If you want to experiment:

 http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=requests

should have an interesting fix, at least for trying to prevent the TearFree leading to the semaphore hang.

Comment 191 arrowsmith 2014-09-15 23:20:15 UTC

What information is most useful for these repeating issues, as it just happened again:

 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139690] [drm] stuck on render ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139699] [drm] stuck on blitter ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.140239] [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [26353], reason: Ring hung, action: reset
 Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.140750] [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] stuck on render ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] stuck on blitter ring
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm] GPU HANG: ecode 0:0xf4e9fffe, in Xorg [26353], reason: Ring hung, action: reset
 Sep 16 08:32:59 arrowsmithlap1 kernel: [drm:i915_context_is_banned] *ERROR* gpu hanging too fast, banning!
 Sep 16 08:33:01 arrowsmithlap1 kernel: [1182244.142445] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
 Sep 16 08:33:01 arrowsmithlap1 kernel: [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off

The only thing under my /etc/X11/xorg.conf.d/ is 00-keyboard.conf (system generated).

Do you want a copy of /sys/class/drm/card0/error every time?

Comment 192 Chris Wilson 2014-09-16 06:10:00 UTC

(In reply to comment #191)
> What information is most useful for these repeating issues, as it just
> happened again:
> 
>  Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139690] [drm] stuck on
> render ring
>  Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139699] [drm] stuck on
> blitter ring

So long as it is the same event, there is no more information we need other than testing feedback for an eventual workaround.

Comment 193 Manuel Widmer 2014-09-22 20:11:56 UTC

(In reply to comment #184)
> (In reply to comment #183)
> 
> I recommend configuring i915.semaphores=0. I did it and it doesn't freeze
> anymore.

Meanwhile I tested both i915.semaphores=0 and i915.semaphores=1 neither of which did help in my case. But with i915.semaphores=0 my system became much more unstable and even crashed on its own after some days without stress on graphics (just ran some desktop apps like thunar or vlc for music only - no movies). With i915.semaphores=1 the system is at least stable (for some weeks) as long as I don't heavily use desktop applications.

Comment 194 Mika Kuoppala 2014-10-20 08:27:12 UTC

*** Bug 85194 has been marked as a duplicate of this bug. ***

Comment 195 Chris Wilson 2014-10-22 16:57:55 UTC

*** Bug 85333 has been marked as a duplicate of this bug. ***

Comment 196 Chris Wilson 2014-10-29 21:11:34 UTC

*** Bug 85609 has been marked as a duplicate of this bug. ***

Comment 197 Josh Glover 2014-11-03 11:47:51 UTC

I am also experiencing this, on a Gentoo system running on a ThinkPad T440s. I'm not doing anything related to XBMC, simply using xrandr for multihead. The interesting thing is that DRI works fine on my laptop screen (glxgears reports 60fps, which is the refresh rate of my screen), but breaks when I move a window trying to use DRI (e.g. Chrome, glxgears) to the external monitor connected to the mini Display Port output.

I see this stuff in dmesg:

[ 3561.424762] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3561.424770] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[ 3561.424772] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[ 3561.424774] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[ 3561.424776] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[ 3561.424778] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[ 3566.422957] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3571.425143] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring
[ 3575.423680] [drm:ring_stuck] *ERROR* Kicking stuck wait on blitter ring

Seems like the same issue. I'm trying to downgrade X, mesa, et al., to try and get the system back in working order.

Comment 198 Daniel Vetter 2014-11-04 14:28:28 UTC

*** Bug 79675 has been marked as a duplicate of this bug. ***

Comment 199 Chris Wilson 2014-11-06 16:13:14 UTC

*** Bug 85972 has been marked as a duplicate of this bug. ***

Comment 200 Chris Wilson 2014-11-09 16:41:02 UTC

*** Bug 86058 has been marked as a duplicate of this bug. ***

Comment 201 Peter Frühberger 2014-11-11 11:51:26 UTC

For those running Ubuntu, here is a build of a kernel based on 3.17.1 with the patches Chris Willson wants you to test:

- Those patches have other regressions (so be careful to only test your specific issue).

https://dl.dropboxusercontent.com/u/55728161/linux-headers-3.17.1simonickle_3.17.1simonickle-10.00.Custom_amd64.deb
https://dl.dropboxusercontent.com/u/55728161/linux-image-3.17.1simonickle_3.17.1simonickle-10.00.Custom_amd64.deb

Those kernels are based on: https://bugs.freedesktop.org/show_bug.cgi?id=83677#c35

Beware, don't switch VTs.

Comment 202 Tomas Huryn 2014-11-17 18:41:40 UTC

I've tryed the mentioned kernel on my Fedora 21 Beta and still hangs after for example Netbeans opens main window for the whole screen.

Comment 203 Smruti 2014-11-19 03:51:44 UTC

*** Bug 86437 has been marked as a duplicate of this bug. ***

Comment 204 Chris Wilson 2014-11-27 07:36:13 UTC

*** Bug 86765 has been marked as a duplicate of this bug. ***

Comment 205 Chris Wilson 2014-11-29 10:50:09 UTC

*** Bug 86836 has been marked as a duplicate of this bug. ***

Comment 206 Chris Wilson 2014-12-02 09:28:08 UTC

*** Bug 86925 has been marked as a duplicate of this bug. ***

Comment 207 Chris Wilson 2014-12-25 20:36:14 UTC

*** Bug 87710 has been marked as a duplicate of this bug. ***

Comment 208 Chris Wilson 2014-12-28 09:35:52 UTC

*** Bug 87776 has been marked as a duplicate of this bug. ***

Comment 209 Chris Wilson 2015-01-17 21:14:25 UTC

*** Bug 88541 has been marked as a duplicate of this bug. ***

Comment 210 samuel.rakitnican 2015-01-20 12:26:49 UTC

*** Bug 88626 has been marked as a duplicate of this bug. ***

Comment 211 Chris Wilson 2015-01-22 21:40:51 UTC

*** Bug 88723 has been marked as a duplicate of this bug. ***

Comment 212 Chris Wilson 2015-01-26 08:56:30 UTC

*** Bug 88789 has been marked as a duplicate of this bug. ***

Comment 213 Chris Wilson 2015-02-11 10:10:03 UTC

*** Bug 89078 has been marked as a duplicate of this bug. ***

Comment 214 Chris Wilson 2015-02-24 14:37:40 UTC

*** Bug 89299 has been marked as a duplicate of this bug. ***

Comment 215 Chris Wilson 2015-03-13 11:50:39 UTC

*** Bug 89570 has been marked as a duplicate of this bug. ***

Comment 216 Chris Wilson 2015-03-19 10:08:41 UTC

*** Bug 89671 has been marked as a duplicate of this bug. ***

Comment 217 Chris Wilson 2015-03-26 08:19:23 UTC

*** Bug 89774 has been marked as a duplicate of this bug. ***

Comment 218 Chris Wilson 2015-03-26 08:19:50 UTC

*** Bug 89771 has been marked as a duplicate of this bug. ***

Comment 219 Chris Wilson 2015-04-11 13:53:49 UTC

*** Bug 89981 has been marked as a duplicate of this bug. ***

Comment 220 Chris Wilson 2015-04-20 07:13:20 UTC

*** Bug 90106 has been marked as a duplicate of this bug. ***

Comment 221 Chris Wilson 2015-04-22 21:57:45 UTC

*** Bug 90146 has been marked as a duplicate of this bug. ***

Comment 222 Chris Wilson 2015-05-03 10:42:37 UTC

*** Bug 90271 has been marked as a duplicate of this bug. ***

Comment 223 Chris Wilson 2015-05-15 20:46:46 UTC

*** Bug 90473 has been marked as a duplicate of this bug. ***

Comment 224 Chris Wilson 2015-06-04 05:46:09 UTC

*** Bug 90835 has been marked as a duplicate of this bug. ***

Comment 225 Martin Steigerwald 2015-06-04 09:10:41 UTC

Chris, you referred me to this bug as I reported

Bug 90835 - [4.1-rc6] gpu hang: ecode 6:-1:0x00000000, Kicking stuck semaphore on render ring

I skimmed through it and it appears that there are some patches to test? But I am not sure which ones these are. Can you or someone else enlighten me?

Also I note that I still use

        Option          "AccelMethod"   "uxa"

and I have

martin@merkaba:~> cat /etc/modprobe.d/i915-kms.conf 
options i915 modeset=1 i915_enable_rc6=7

thus maximum energy saving. But according to powertop it never enters the highest sleep state anyway.

I will remove the AccelMethod setting now and see whether it helps. If not, I downgrade to 4.1-rc4 for now, as issues have been at least much less frequent with it.

And its really that for me 4.1-rc6 makes things much *worse*. I am typing this after a clean reboot and already got the GPU hang again. It happens about every few minutes. Are you really sure this is the same GPU hang? I didn´t have this before 4.1 kernel?

Comment 226 Chris Wilson 2015-06-04 09:24:35 UTC

(In reply to Martin Steigerwald from comment #225)
> Chris, you referred me to this bug as I reported
> 
> Bug 90835 - [4.1-rc6] gpu hang: ecode 6:-1:0x00000000, Kicking stuck
> semaphore on render ring
> 
> I skimmed through it and it appears that there are some patches to test? But
> I am not sure which ones these are. Can you or someone else enlighten me?

There's likely a modest improvement in 4.2.

> Also I note that I still use
> 
>         Option          "AccelMethod"   "uxa"
> 
> and I have
> 
> martin@merkaba:~> cat /etc/modprobe.d/i915-kms.conf 
> options i915 modeset=1 i915_enable_rc6=7

Fortuitously that dangerous option doesn't do anything for your kernel.

> ffffffff813a4b0e
> thus maximum energy saving. But according to powertop it never enters the
> highest sleep state anyway.
> 
> I will remove the AccelMethod setting now and see whether it helps. If not,
> I downgrade to 4.1-rc4 for now, as issues have been at least much less
> frequent with it.

Purely circumstantial.

> And its really that for me 4.1-rc6 makes things much *worse*. I am typing
> this after a clean reboot and already got the GPU hang again. It happens
> about every few minutes. Are you really sure this is the same GPU hang? I
> didn´t have this before 4.1 kernel?

Yes.

Comment 227 Martin Steigerwald 2015-06-04 13:51:06 UTC

(In reply to Chris Wilson from comment #226)
> (In reply to Martin Steigerwald from comment #225)
> > Chris, you referred me to this bug as I reported
> > 
> > Bug 90835 - [4.1-rc6] gpu hang: ecode 6:-1:0x00000000, Kicking stuck
> > semaphore on render ring
> > 
> > I skimmed through it and it appears that there are some patches to test? But
> > I am not sure which ones these are. Can you or someone else enlighten me?
> 
> There's likely a modest improvement in 4.2.

Nice.

> > Also I note that I still use
> > 
> >         Option          "AccelMethod"   "uxa"
> > 
> > and I have
> > 
> > martin@merkaba:~> cat /etc/modprobe.d/i915-kms.conf 
> > options i915 modeset=1 i915_enable_rc6=7
> 
> Fortuitously that dangerous option doesn't do anything for your kernel.

Well I found out why, I compiled i915 into the kernel it seems, at least I don´t have an i915 module in lsmod. But also i915.i915_enable_rc6=7 on kernel command line does not seem to have any effect. I removed the option.

> > ffffffff813a4b0e
> > thus maximum energy saving. But according to powertop it never enters the
> > highest sleep state anyway.
> > 
> > I will remove the AccelMethod setting now and see whether it helps. If not,
> > I downgrade to 4.1-rc4 for now, as issues have been at least much less
> > frequent with it.
> 
> Purely circumstantial.

Since using SNA I didn´t see a GPU hang so far. Too early to say for sure, but it seems something in UXA may have triggered it more easily.

Comment 228 Chris Wilson 2015-07-03 15:31:01 UTC

*** Bug 91212 has been marked as a duplicate of this bug. ***

Comment 229 Chris Wilson 2015-08-17 09:40:43 UTC

*** Bug 91662 has been marked as a duplicate of this bug. ***

Comment 230 Chris Wilson 2015-08-30 13:26:46 UTC

*** Bug 91810 has been marked as a duplicate of this bug. ***

Comment 231 Chris Wilson 2015-09-02 09:46:33 UTC

*** Bug 91832 has been marked as a duplicate of this bug. ***

Comment 232 samuel.rakitnican 2015-09-25 08:47:18 UTC

(In reply to Chris Wilson from comment #192)
> (In reply to comment #191)
> > What information is most useful for these repeating issues, as it just
> > happened again:
> > 
> >  Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139690] [drm] stuck on
> > render ring
> >  Sep 16 08:32:59 arrowsmithlap1 kernel: [1182242.139699] [drm] stuck on
> > blitter ring
> 
> So long as it is the same event, there is no more information we need other
> than testing feedback for an eventual workaround.

Is this the same bug?

$ journalctl -p 3 -b -1
Ruj 25 02:13:01 crnigrom kernel: [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
Ruj 25 02:13:01 crnigrom kernel: [drm:__gen6_gt_wait_for_thread_c0.isra.16 [i915]] *ERROR* GT thread status wait timed out
... [ repeated messages ] ...
Ruj 25 02:13:33 crnigrom kernel: [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
Ruj 25 02:13:33 crnigrom kernel: [drm:__gen6_gt_wait_for_thread_c0.isra.16 [i915]] *ERROR* GT thread status wait timed out
Ruj 25 02:13:34 crnigrom kernel: [drm:stop_ring [i915]] *ERROR* render ring : timed out trying to stop ring
Ruj 25 02:13:34 crnigrom kernel: [drm:init_ring_common [i915]] *ERROR* render ring initialization failed ctl 00000000 (valid? 0) head 00000000 tail 00000000 start 00000000 [expected 00000000]
Ruj 25 02:13:34 crnigrom kernel: [drm:i915_reset [i915]] *ERROR* Failed hw init on reset -5
Ruj 25 02:13:34 crnigrom gnome-session[1823]: Unrecoverable failure in required component gnome-shell.desktop

After which gnome crashes with "Oh No Something Is Wrong" screen

$ uname -r
4.1.7-200.fc22.x86_64

Hardware i3-2100 CPU/GPU

This bug is going on already for a long long time, but at least computer is not hard freezing anymore, although gnome is crashing so any gtk applications running doing something stalls.

Comment 233 Chris Wilson 2015-09-28 08:04:12 UTC

*** Bug 92118 has been marked as a duplicate of this bug. ***

Comment 234 Chris Wilson 2015-10-31 09:34:20 UTC

*** Bug 92739 has been marked as a duplicate of this bug. ***

Comment 235 arrowsmith 2015-11-02 03:04:50 UTC

FWIW, my issue (https://bugs.freedesktop.org/show_bug.cgi?id=54226#c191), was resolved by uninstalling various components, re-installing and updating them. I have a hunch (completely unproven) that it was a transparent bit-fail issue from the SSD. By un-installing and re-installing, the files were likely installed to a different location on the drive. It wasn't configuration, as I tried erasing, and even rolling back to defaults, with the problem still persisting. As it was almost daily, prior to uninstall, and hasn't happened since the install, this is all I can attribute it to.

HTH someone.

Comment 236 Jeffrey E. Bedard 2015-11-06 06:43:54 UTC

Created attachment 119432 [details]
attachment-28908-0.html

I reported this bug from a system without an SSD.  Recently, I have not
seen the kernel messages appear however--currently on linux 4.2.5.

On Sun, Nov 1, 2015 at 10:04 PM, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 235 <https://bugs.freedesktop.org/show_bug.cgi?id=54226#c235>
> on bug 54226 <https://bugs.freedesktop.org/show_bug.cgi?id=54226> from
> arrowsmith@pythian.com <arrowsmith@pythian.com> *
>
> FWIW, my issue (https://bugs.freedesktop.org/show_bug.cgi?id=54226#c191), was
> resolved by uninstalling various components, re-installing and updating them. I
> have a hunch (completely unproven) that it was a transparent bit-fail issue
> from the SSD. By un-installing and re-installing, the files were likely
> installed to a different location on the drive. It wasn't configuration, as I
> tried erasing, and even rolling back to defaults, with the problem still
> persisting. As it was almost daily, prior to uninstall, and hasn't happened
> since the install, this is all I can attribute it to.
>
> HTH someone.
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 237 arrowsmith 2015-11-06 06:47:37 UTC

(In reply to Jeffrey E. Bedard from comment #236)
> Created attachment 119432 [details]
> attachment-28908-0.html
> 
> I reported this bug from a system without an SSD.  Recently, I have not
> seen the kernel messages appear however--currently on linux 4.2.5.

Ah, let me clarify that earlier comment: I dd'd a failing spinning drive to an SSD. There was lots of clicking. Upgraded packages as they came in, but no change. Only the uninstall and re-install cleared the repeat button. :)

Comment 238 Jeffrey E. Bedard 2015-11-06 06:56:33 UTC

Created attachment 119433 [details]
attachment-32271-0.html

I think this bug can be marked as closed with the latest linux/mesa/xorg
versions :)

On Fri, Nov 6, 2015 at 1:47 AM, <bugzilla-daemon@freedesktop.org> wrote:

> *Comment # 237 <https://bugs.freedesktop.org/show_bug.cgi?id=54226#c237>
> on bug 54226 <https://bugs.freedesktop.org/show_bug.cgi?id=54226> from
> arrowsmith@pythian.com <arrowsmith@pythian.com> *
>
> (In reply to Jeffrey E. Bedard from comment #236 <https://bugs.freedesktop.org/show_bug.cgi?id=54226#c236>)> Created attachment 119432 [details] <https://bugs.freedesktop.org/attachment.cgi?id=119432> [details] <https://bugs.freedesktop.org/attachment.cgi?id=119432&action=edit>
> > attachment-28908-0.html
> >
> > I reported this bug from a system without an SSD.  Recently, I have not
> > seen the kernel messages appear however--currently on linux 4.2.5.
>
> Ah, let me clarify that earlier comment: I dd'd a failing spinning drive to an
> SSD. There was lots of clicking. Upgraded packages as they came in, but no
> change. Only the uninstall and re-install cleared the repeat button. :)
>
> ------------------------------
> You are receiving this mail because:
>
>    - You are on the CC list for the bug.
>
>

Comment 239 Chris Wilson 2015-11-12 21:28:01 UTC

*** Bug 92927 has been marked as a duplicate of this bug. ***

Comment 240 Chris Wilson 2015-11-21 16:41:45 UTC

*** Bug 93057 has been marked as a duplicate of this bug. ***

Comment 241 Kurt Roeckx 2015-11-28 12:28:00 UTC

Created attachment 120189 [details]
error state with 4.2 kernel

Comment 242 Chris Wilson 2015-12-10 21:18:24 UTC

*** Bug 93331 has been marked as a duplicate of this bug. ***

Comment 243 Chris Wilson 2015-12-23 09:31:03 UTC

*** Bug 93482 has been marked as a duplicate of this bug. ***

Comment 244 Chris Wilson 2015-12-24 13:21:44 UTC

*** Bug 93493 has been marked as a duplicate of this bug. ***

Comment 245 Chris Wilson 2015-12-30 21:43:32 UTC

*** Bug 89524 has been marked as a duplicate of this bug. ***

Comment 246 Chris Wilson 2016-01-05 20:53:10 UTC

*** Bug 93595 has been marked as a duplicate of this bug. ***

Comment 247 Chris Wilson 2016-01-26 20:45:40 UTC

*** Bug 93876 has been marked as a duplicate of this bug. ***

Comment 248 Chris Wilson 2016-02-19 09:39:17 UTC

*** Bug 93824 has been marked as a duplicate of this bug. ***

Comment 249 Chris Wilson 2016-03-01 20:43:23 UTC

*** Bug 94057 has been marked as a duplicate of this bug. ***

Comment 250 Sander Eikelenboom 2016-03-01 21:27:57 UTC

Tuesday, March 1, 2016, 9:43:23 PM, you wrote:

> Chris Wilson changed  bug 54226
> WhatRemovedAddedCC   russ.pridemore@gmail.com  
>   

>   Comment # 249               on bug 54226               from Chris Wilson   
> *** Bug 94057 has been marked as a duplicate of this bug. ***
>       

>   You are receiving this mail because:         
>   You are on the CC list for the bug.  
>     

Sorry to say, but:
Is there a way to get off the CC-list of this slightly depressing kind of "catch-all" bug ?
It unfortunately doesn't seem to have be going anywhere for the last 3 to 4 years accept
for an endless stream of duplicates being appended.

--
Sander

Comment 251 Jani Nikula 2016-03-02 11:33:16 UTC

(In reply to Sander Eikelenboom from comment #250)
> Is there a way to get off the CC-list of this slightly depressing kind of
> "catch-all" bug ?

CC list is at the top right corner. Choose the address, tick "Remove selected CCs", and hit Save Changes.

I've done this for you now.

Comment 252 Chris Wilson 2016-05-02 12:10:27 UTC

*** Bug 95238 has been marked as a duplicate of this bug. ***

Comment 253 Samantha McVey 2016-06-24 06:45:33 UTC

Chris, I seem to be experiencing this bug in Linux 4.7rc3 on an x220 ThinkPad with Intel HD 3000 chipset. I was getting random full system freeze, non responsive over network.

The main messages before the crash were:
Jun 23 19:11:18 athena kernel: [drm:fw_domains_get [i915]] *ERROR* render: timed out waiting for forcewake ack request.
Jun 23 19:11:18 athena kernel: [drm:__gen6_gt_wait_for_thread_c0.isra.7 [i915]] *ERROR* GT thread status wait timed out.

The original crash I haven't been able to reproduce easily but I CAN reproduce every time a full system lockup running the following intel-gpu-tools tests (I have not even close to run all the tests though) [**This may or may not be related to the original crash**]

gem_sync, subtest: bsd2-hang
drv_hangman, subtest: error-state-capture-bit

I do not know if these tests are helpful or related (maybe some are known to fail? not sure).
I have drm debugging turned on for when I ran those tests. (drm.debug=0x1e log_buf_len=1M)
I can post logs of the hangs associated with the two tests/subtests and run any other tests if you desire (with kernel drm debug on), I will wait for the issue to reappear with the drm debug on before posting that log though. By the number of similar bugs you may already have the CALL TRACE and non-debug level logs.

I know how to patch and am able to compile kernels to test. The bug effects me maybe once every 1 or 2 days. I use XOrg with Glamor. I have been seeing these crashes since 4.6 (maybe 4.5 or earlier not sure).

I know how to apply patches and am able to compile drm-next or any patches you have to see if this issue can be isolated. Thanks, sorry for the long response.

Comment 254 Chris Wilson 2016-08-11 17:51:22 UTC

*** Bug 97304 has been marked as a duplicate of this bug. ***

Comment 255 Chris Wilson 2016-08-23 14:00:54 UTC

*** Bug 97451 has been marked as a duplicate of this bug. ***

Comment 256 yann 2016-10-17 15:03:36 UTC

*** Bug 98294 has been marked as a duplicate of this bug. ***

Comment 257 Chris Wilson 2016-11-21 12:22:53 UTC

*** Bug 98807 has been marked as a duplicate of this bug. ***

Comment 258 Chris Wilson 2017-03-17 09:13:19 UTC

*** Bug 100245 has been marked as a duplicate of this bug. ***

Comment 259 Ricardo 2017-05-09 17:21:33 UTC

Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
*Logs included

Comment 260 samuel.rakitnican 2017-07-14 14:34:15 UTC

I doesn't seem to be getting mentioned Gnome crashes on my sandybridge anymore with mainline kernels, that is currently 4.11 and I think even with 4.10 I was not getting any issues, with mainline longterm 4.4.61 and default centos 7 kernels I am definitely getting very frequent GPU crashes that brings down Gnome.

So it is either fixed for good, or it become much rarer. The issue I am/was experiencing happens when Gnome is running, it does not happen when only GDM is loaded. System load seems to not have effect on the bug triggering, seems to happen any time, on idle, or when machine is loaded.

Comment 261 Elizabeth 2017-07-31 21:05:42 UTC

(In reply to samuel.rakitnican from comment #260)
> I doesn't seem to be getting mentioned Gnome crashes on my sandybridge
> anymore with mainline kernels, that is currently 4.11 and I think even with
> 4.10 I was not getting any issues, with mainline longterm 4.4.61 and default
> centos 7 kernels I am definitely getting very frequent GPU crashes that
> brings down Gnome.
> 
> So it is either fixed for good, or it become much rarer. The issue I am/was
> experiencing happens when Gnome is running, it does not happen when only GDM
> is loaded. System load seems to not have effect on the bug triggering, seems
> to happen any time, on idle, or when machine is loaded.
Hopefully, is fixed for good. I'm closing this bug, if problem arise with latest kernel versions https://www.kernel.org/ please open a NEW bug with HW and SW information, steps to reproduce and relevant logs.Thank you.

Comment 262 Chris Wilson 2017-10-28 10:38:24 UTC

(In reply to Elizabeth from comment #261)
> (In reply to samuel.rakitnican from comment #260)
> > I doesn't seem to be getting mentioned Gnome crashes on my sandybridge
> > anymore with mainline kernels, that is currently 4.11 and I think even with
> > 4.10 I was not getting any issues, with mainline longterm 4.4.61 and default
> > centos 7 kernels I am definitely getting very frequent GPU crashes that
> > brings down Gnome.
> > 
> > So it is either fixed for good, or it become much rarer. The issue I am/was
> > experiencing happens when Gnome is running, it does not happen when only GDM
> > is loaded. System load seems to not have effect on the bug triggering, seems
> > to happen any time, on idle, or when machine is loaded.
> Hopefully, is fixed for good. I'm closing this bug, if problem arise with
> latest kernel versions https://www.kernel.org/ please open a NEW bug with HW
> and SW information, steps to reproduce and relevant logs.Thank you.

There was no fix for this HW issue.

Comment 263 Aaron Lu 2017-10-31 01:33:50 UTC

Created attachment 135173 [details]
gpu error file on 4.13.5-200.fc26.x86_64

This problem reappeared on 4.13.5-200.fc26.x86_64 last Friday.

[774249.632109] [drm] GPU HANG: ecode 6:0:0x85fffff8, in Xorg [696], reason: Hang on rcs0, action: reset                                      
[774249.632110] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.                                     
[774249.632111] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel                                         
[774249.632111] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.                                
[774249.632111] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.                                        
[774249.632112] [drm] GPU crash dump saved to /sys/class/drm/card0/error                                                                      
[774249.632172] drm/i915: Resetting chip after gpu hang

Comment 264 Chris Wilson 2017-11-20 22:18:02 UTC

commit 0da715ee60774401bea00dc71fca6fd1096c734a
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 20 20:55:02 2017 +0000

    drm/i915: Disable semaphores on Sandybridge

Comment 265 Chris Wilson 2017-12-13 14:30:13 UTC

*** Bug 104243 has been marked as a duplicate of this bug. ***

Comment 266 Chris Wilson 2017-12-17 14:03:27 UTC

*** Bug 104304 has been marked as a duplicate of this bug. ***

Comment 267 Chris Wilson 2018-01-24 16:39:43 UTC

*** Bug 104772 has been marked as a duplicate of this bug. ***

Comment 268 Jani Saarinen 2018-03-28 15:53:58 UTC

I will close this now.

Comment 269 Chris Wilson 2018-04-18 07:59:11 UTC

*** Bug 106119 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.