Bug 99671 - [v4.10 snb] weird seqno/request tracking - HEAD overtakes TAIL
Summary: [v4.10 snb] weird seqno/request tracking - HEAD overtakes TAIL
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 100110 100454 102120 103407 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-02-04 09:03 UTC by Josh Holland
Modified: 2018-04-25 06:38 UTC (History)
6 users (show)

See Also:
i915 platform: SNB
i915 features: GEM/Other


Attachments
GPU crash dump (2.19 MB, text/plain)
2017-02-04 09:03 UTC, Josh Holland
no flags Details
dmesg (71.81 KB, text/plain)
2017-02-04 09:04 UTC, Josh Holland
no flags Details
dmesg from 4.10.0-994 kernel (56.67 KB, text/plain)
2017-02-09 21:24 UTC, Josh Holland
no flags Details
GPU crash dump from 4.10.0-994 kernel (21.79 KB, text/plain)
2017-02-09 21:25 UTC, Josh Holland
no flags Details
dmesg from drm-tip fb21519ea (56.97 KB, text/plain)
2017-02-10 22:16 UTC, Josh Holland
no flags Details
GPU crash dump from drm-tip fb21519ea (24.13 KB, text/plain)
2017-02-10 22:17 UTC, Josh Holland
no flags Details
dmesg from fb21519ea with patch from #19 (143.26 KB, text/plain)
2017-02-11 19:08 UTC, Josh Holland
no flags Details
dmesg from 58294e406 with patch from #19 (64.68 KB, text/plain)
2017-02-13 21:22 UTC, Josh Holland
no flags Details
GPU crash dump from 58294e406 with patch from #19 (30.73 KB, text/plain)
2017-02-13 21:23 UTC, Josh Holland
no flags Details
dmesg from 1d7915e78 with patches #19 and #30 (57.79 KB, text/plain)
2017-03-03 21:18 UTC, Josh Holland
no flags Details
GPU dump from 1d7915e78 with patches #19 and #30 (29.00 KB, text/plain)
2017-03-03 21:18 UTC, Josh Holland
no flags Details
dmesg from ec496685b, patches 19, 30, 34 (60.24 KB, text/plain)
2017-03-05 11:26 UTC, Josh Holland
no flags Details
GPU dump from ec496685b, patches 19, 30, 34 (28.58 KB, text/plain)
2017-03-05 11:27 UTC, Josh Holland
no flags Details
Test to see if TAIL writes go backwards (1.12 KB, patch)
2017-03-18 17:30 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg from d8839e27a (76.43 KB, text/plain)
2017-03-20 20:37 UTC, Josh Holland
no flags Details
GPU dump from d8839e27a (28.12 KB, text/plain)
2017-03-20 20:37 UTC, Josh Holland
no flags Details
dmesg from d6a919d39 (72.36 KB, text/plain)
2017-05-24 14:25 UTC, Josh Holland
no flags Details
GPU dump from d6a919d39 (44.15 KB, text/plain)
2017-05-24 14:25 UTC, Josh Holland
no flags Details
dmesg (Ubuntu kernel 4.4.0-79, Mesa 17.1.2) (152.78 KB, text/plain)
2017-06-27 15:51 UTC, Josh Holland
no flags Details
GPU dump (Ubuntu kernel 4.4.0-79, Mesa 17.1.2) (3.51 MB, text/plain)
2017-06-27 15:52 UTC, Josh Holland
no flags Details

Description Josh Holland 2017-02-04 09:03:34 UTC
Created attachment 129333 [details]
GPU crash dump

Randomly when waking from suspend, I get graphical issues, with this in dmesg:

[36834.014792] [drm] stuck on render ring
[36834.015296] [drm] GPU HANG: ecode 6:0:0xbd69ffff, in compiz [2609], reason: Ring hung, action: reset
[36834.015299] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[36834.015300] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[36834.015301] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[36834.015302] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[36834.015303] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[36834.017384] drm/i915: Resetting chip after gpu hang
[36840.026991] [drm] stuck on render ring
[36840.027509] [drm] GPU HANG: ecode 6:0:0xfeffffff, in compiz [2609], reason: Ring hung, action: reset
[36840.029611] drm/i915: Resetting chip after gpu hang

The graphical glitches vary; mostly, all my windows just get moved to one workspace. This time, large amounts of text in some applications disappeared -- very little text is visible in gnome-system-monitor and nautilus (even when closed and reopened), but Chrome and gnome-terminal aren't affected.

I'm running Ubuntu 16.04, Linux yes 4.4.0-62-generic #83-Ubuntu SMP Wed Jan 18 14:10:15 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux, on a Lenovo G580 laptop.
Comment 1 Josh Holland 2017-02-04 09:04:09 UTC
Created attachment 129334 [details]
dmesg
Comment 2 Chris Wilson 2017-02-04 09:33:19 UTC
The request list doesn't match hardware state -- please try with a later kernel, though a fix for a problem you may be hitting hasn't landed in upstream yet, so please try drm-tip.
Comment 3 Josh Holland 2017-02-09 15:50:09 UTC
Ok, I'm installing the 2017-02-09 build from http://kernel.ubuntu.com/~kernel-ppa/mainline/drm-tip/current/. I got some warnings on installation:

W: Possible missing firmware /lib/firmware/i915/kbl_dmc_ver1_01.bin for module i915
W: Possible missing firmware /lib/firmware/i915/glk_dmc_ver1_01.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_guc_ver9_14.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_guc_ver8_7.bin for module i915
W: Possible missing firmware /lib/firmware/i915/kbl_huc_ver02_00_1810.bin for module i915
W: Possible missing firmware /lib/firmware/i915/bxt_huc_ver01_07_1398.bin for module i915
W: Possible missing firmware /lib/firmware/i915/skl_huc_ver01_07_1398.bin for module i915

but I have no idea if they're relevant or are useful.
Comment 4 Josh Holland 2017-02-09 16:19:54 UTC
The 4.10.0-994-generic kernel seems initially to work very well; I can't see any errors in dmesg about i915, unlike with Ubuntu's 4.4.0 kernels, and the TTYs actually display text, which they haven't for a while.

If this issue is still present with this kernel, it's likely to be a few days/weeks before I run into it, so please don't close this bug yet.
Comment 5 Josh Holland 2017-02-09 21:23:24 UTC
Looks like I spoke too soon -- I just got a crash, much worse than previous ones. I was actively using the machine, not suspending/unsuspending it, and the GUI totally froze except the mouse. Switching to a TTY worked though, and I captured dmesg and a GPU crash dump; uploading those here.
Comment 6 Josh Holland 2017-02-09 21:24:42 UTC
Created attachment 129457 [details]
dmesg from 4.10.0-994 kernel
Comment 7 Josh Holland 2017-02-09 21:25:14 UTC
Created attachment 129458 [details]
GPU crash dump from 4.10.0-994 kernel
Comment 8 Chris Wilson 2017-02-09 21:43:56 UTC
Baffling. Looks like the same issue, requests are being retired before their seqno is complete and objects reused before they are idle. That should not be possible!
Comment 9 Chris Wilson 2017-02-09 23:20:05 UTC
It's fair to say that progress at this point will mean you compiling your own kernel from https://cgit.freedesktop.org/drm-tip and then we start trying debug patches.
Comment 10 Josh Holland 2017-02-10 13:26:06 UTC
Ok, I can probably do that. Having never built the kernel before, where do you suggest I get the config from -- the Ubuntu 4.10 drm-tip build?
Comment 11 Chris Wilson 2017-02-10 14:07:00 UTC
Yes, cp /boot/config-`uname -r` .config is a good starting point - it will then at least boot :)
Comment 12 Josh Holland 2017-02-10 19:28:03 UTC
I've built and booted a kernel from drm-tip (revision fb21519ea). Unfortunately, the initrd is huge (almost as big as /boot) which makes it a little hard to deal with. Will using INSTALL_MOD_STRIP=1 make debugging harder?
Comment 13 Chris Wilson 2017-02-10 19:42:54 UTC
I think something is wrong with the mkinitramfs script then, I believe it should (or at least can?) only include the modules required for booting.

Cross your fingers and try make localmodconfig. It only very rarely fails to include a module actually used for booting - usually because the delta between the base distro config and your own is too great. But in this case you have a v4.10 config so should be fine.
Comment 14 Josh Holland 2017-02-10 21:08:13 UTC
make localmodconfig worked fine and reduced the size of initrd to about a tenth of its original size, thanks for mentioning it! Seems like I now have a working kernel from source.
Comment 15 Chris Wilson 2017-02-10 21:17:54 UTC
Cool, next step is wait for a hang and attach it. That's just to be sure we are still reproducing the same issue with a local build.
Comment 16 Josh Holland 2017-02-10 22:16:27 UTC
I got a hang, uploading logs.
Comment 17 Josh Holland 2017-02-10 22:16:57 UTC
Created attachment 129493 [details]
dmesg from drm-tip fb21519ea
Comment 18 Josh Holland 2017-02-10 22:17:21 UTC
Created attachment 129494 [details]
GPU crash dump from drm-tip fb21519ea
Comment 19 Chris Wilson 2017-02-10 22:31:54 UTC
Hmm, still the same :|

Apply this patch:

diff --git a/drivers/gpu/drm/i915/i915_gem.h b/drivers/gpu/drm/i915/i915_gem.h
index 5a49487368ca..998e3780f2c6 100644
--- a/drivers/gpu/drm/i915/i915_gem.h
+++ b/drivers/gpu/drm/i915/i915_gem.h
@@ -26,7 +26,7 @@
 #define __I915_GEM_H__
 
 #ifdef CONFIG_DRM_I915_DEBUG_GEM
-#define GEM_BUG_ON(expr) BUG_ON(expr)
+#define GEM_BUG_ON(expr) WARN_ON(expr)
 #define GEM_WARN_ON(expr) WARN_ON(expr)
 
 #define GEM_DEBUG_DECL(var) var

and please recompile with CONFIG_DRM_I915_DEBUG_GEM. If you using make menuconfig, look under Device Drivers / Graphics / i915 debugging options. You basically need to enable all options there, which also requires setting CONFIG_EXPERT under General Settings.
Comment 20 Josh Holland 2017-02-11 19:08:00 UTC
I got a crash, although I couldn't get the GPU crash dump unfortunately -- I couldn't even switch to a TTY. dmesg looks like it might be more useful now though.

Should I be staying on revision fb21519ea for this, or should I be upgrading to the latest drm-tip?
Comment 21 Josh Holland 2017-02-11 19:08:39 UTC
Created attachment 129517 [details]
dmesg from fb21519ea with patch from #19
Comment 22 Chris Wilson 2017-02-11 19:23:57 UTC
(In reply to Josh Holland from comment #20)
> I got a crash, although I couldn't get the GPU crash dump unfortunately -- I
> couldn't even switch to a TTY. dmesg looks like it might be more useful now
> though.

Hmm, after the hang though and the crash is consistent with the corruption. But it does make the matter much more serious as we go from a gpu hang to a driver lockup. I was expecting the debug code to detect something much, much earlier. 

Hmm. Or maybe it is more significant than I first thought.

Feb 11 17:47:37 yes kernel: [    2.358719] [drm] DRM_I915_DEBUG_GEM enabled
confirms that the debug code was indeed enabled.

> Should I be staying on revision fb21519ea for this, or should I be upgrading
> to the latest drm-tip?

Nothing pertinent to this bug yet, but refreshing everytime we have an idea to test will be helpful (random bug fixes hopefully improving and not adding regressions elsewhere!).
Comment 23 Chris Wilson 2017-02-13 11:24:13 UTC
commit fe3288b5da2c1286a7aac1fb1b2234caa752a81b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Feb 12 17:20:01 2017 +0000

    drm/i915: Park the breadcrumbs signaler across a GPU reset

should fix the bug on hit in comment 19. Could you update drm-tip (git stash; git reset --hard <drm-tip>; git stash apply) and see what pops out of the woodwork this time?

* still hoping for a nice WARN or a sensible error-state!
Comment 24 Josh Holland 2017-02-13 16:56:22 UTC
Ok, now building from 58294e406 with the patch from comment 19.
Comment 25 Josh Holland 2017-02-13 21:21:31 UTC
And another hang. This one I got dmesg and GPU dump from; it seemed more like the hangs I was getting on the Ubuntu 4.4 kernels (where everything apart from the cursor freezes for a minute, but then starts working fine again), rather than the 4.10 ones I've been building, where the entire UI (including cursor) freezes and doesn't seem to recover. Presumably your commit fixed something...
Comment 26 Josh Holland 2017-02-13 21:22:16 UTC
Created attachment 129567 [details]
dmesg from 58294e406 with patch from #19
Comment 27 Josh Holland 2017-02-13 21:23:04 UTC
Created attachment 129568 [details]
GPU crash dump from 58294e406 with patch from #19
Comment 28 Chris Wilson 2017-02-13 21:35:59 UTC
(In reply to Josh Holland from comment #25)
> And another hang. This one I got dmesg and GPU dump from; it seemed more
> like the hangs I was getting on the Ubuntu 4.4 kernels (where everything
> apart from the cursor freezes for a minute, but then starts working fine
> again), rather than the 4.10 ones I've been building, where the entire UI
> (including cursor) freezes and doesn't seem to recover. Presumably your
> commit fixed something...

Yup, we are right back to the original pattern of hangs. But now I know it passed internal sanity checks before doing so. Puzzling. 

I need to think about how this could even arise, in the meantime could you please try running with i915.semaphores=0 on the kernel command line and see if that makes a difference?
Comment 29 Josh Holland 2017-02-14 17:37:18 UTC
Booted with i915.semaphores=0 (still on drm-tip 58294e406 with patch from #19), then got a 4.10-like hang where even Magic SysRq stopped working. I can upload the whole dmesg if you want, but I assume this is the relevant part, ending with the final message (the previous message was half an hour ago when bamfdaemon died, which it does occasionally):

Feb 14 16:51:03 yes kernel: [ 3435.777321] [drm] GPU HANG: ecode 6:0:0xbde7ffff, in compiz [3786], reason: Hang on render ring, action: reset
Feb 14 16:51:03 yes kernel: [ 3435.777323] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
-- ...how to report GPU hangs... --
Feb 14 16:51:03 yes kernel: [ 3435.777324] [drm] GPU crash dump saved to /sys/class/drm/card0/error
Feb 14 16:51:03 yes kernel: [ 3435.777375] drm/i915: Resetting chip after gpu hang
Feb 14 16:51:06 yes kernel: [ 3438.780562] asynchronous wait on fence i915:[global]:3b94e timed out
Feb 14 16:51:11 yes kernel: [ 3443.772358] drm/i915: Resetting chip after gpu hang
Feb 14 16:51:23 yes kernel: [ 3455.931636] [drm:drm_atomic_helper_swap_state [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out
Feb 14 16:51:24 yes kernel: [ 3456.955546] asynchronous wait on fence i915:[global]:3b954 timed out
Feb 14 16:51:34 yes kernel: [ 3466.171061] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out
Feb 14 16:51:35 yes kernel: [ 3467.195005] [drm:drm_atomic_helper_wait_for_dependencies [drm_kms_helper]] *ERROR* [CRTC:31:pipe A] hw_done timed out

Unfortunately no GPU dump, due to the aforementioned lack of response to even an alt-SysRq-B, let alone ctrl-alt-F1.
Comment 30 Chris Wilson 2017-03-01 12:27:37 UTC
Can you try:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 4ffa35faff49..df094699ba9d 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1217,10 +1217,6 @@ static int init_status_page(struct intel_engine_cs *engine)
                return PTR_ERR(obj);
        }
 
-       ret = i915_gem_object_set_cache_level(obj, I915_CACHE_LLC);
-       if (ret)
-               goto err;
-
        vma = i915_vma_instance(obj, &engine->i915->ggtt.base, NULL);
        if (IS_ERR(vma)) {
                ret = PTR_ERR(vma);
@@ -1244,7 +1240,7 @@ static int init_status_page(struct intel_engine_cs *engine)
        if (ret)
                goto err;
 
-       vaddr = i915_gem_object_pin_map(obj, I915_MAP_WB);
+       vaddr = i915_gem_object_pin_map(obj, I915_MAP_WC);
        if (IS_ERR(vaddr)) {
                ret = PTR_ERR(vaddr);
                goto err_unpin;
Comment 31 Josh Holland 2017-03-03 21:17:03 UTC
Ok, built drm-tip 1d7915e78 with patches from #19 and #30, then got several hangs in a row. Attaching the last dmesg and GPU dump; dmesg is longer, and the GPU dump is barely different (only timestamps changed between hangs).

Is there anything useful I can be giving you besides dmesg and /sys/class/drm/card0/error?
Comment 32 Josh Holland 2017-03-03 21:18:26 UTC
Created attachment 130055 [details]
dmesg from 1d7915e78 with patches #19 and #30
Comment 33 Josh Holland 2017-03-03 21:18:51 UTC
Created attachment 130056 [details]
GPU dump from 1d7915e78 with patches #19 and #30
Comment 34 Chris Wilson 2017-03-03 21:56:37 UTC
dmesg & error are just what I need. (If I need anything else, the goal is to add it to the error state.)

Onto the next theory, a few more asserts:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 4ffa35f..5a7c140 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -44,6 +44,7 @@ static int __intel_ring_space(int head, int tail, int size)
        int space = head - tail;
        if (space <= 0)
                space += size;
+       GEM_BUG_ON(space <= I915_RING_FREE_SPACE);
        return space - I915_RING_FREE_SPACE;
 }
 
@@ -1682,6 +1683,8 @@ u32 *intel_ring_begin(struct drm_i915_gem_request *req, int num_dwords)
                wait_bytes = total_bytes;
        }
 
+       GEM_BUG_ON(ring->space > __intel_ring_space(ring->head & HEAD_ADDR,
+                                                   ring->tail, ring->size));
        if (wait_bytes > ring->space) {
                int ret = wait_for_space(req, wait_bytes);
                if (unlikely(ret))
@@ -1698,6 +1701,7 @@ u32 *intel_ring_begin(struct drm_i915_gem_request *req, int num_dwords)
                ring->space -= remain_actual;
        }
 
+       GEM_BUG_ON(bytes > ring->space);
        GEM_BUG_ON(ring->tail > ring->size - bytes);
        cs = ring->vaddr + ring->tail;
        ring->tail += bytes;
Comment 35 Josh Holland 2017-03-05 11:26:48 UTC
Built from ec496685b with patches #19, #30 and #34, and got a hang where Compiz exploded.

Is something munging the tabs in your patches, by the way? I've had to apply the last two by hand.
Comment 36 Josh Holland 2017-03-05 11:26:56 UTC
Created attachment 130066 [details]
dmesg from ec496685b, patches 19, 30, 34
Comment 37 Josh Holland 2017-03-05 11:27:01 UTC
Created attachment 130067 [details]
GPU dump from ec496685b, patches 19, 30, 34
Comment 38 Chris Wilson 2017-03-05 13:55:19 UTC
(In reply to Josh Holland from comment #35)
> Is something munging the tabs in your patches, by the way? I've had to apply
> the last two by hand.

Lazily pasting from a terminal that likes to expand tabs into the clipboard.

The pattern is that the ring is running past the TAIL. Now I need to find an explanation that doesn't involve the hw playing games with us.
Comment 39 Chris Wilson 2017-03-06 11:15:51 UTC
Please try (pardon the tabs):


diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 4ffa35faff49..62e31a7438ac 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -782,10 +782,10 @@ static void i9xx_submit_request(struct drm_i915_gem_request *request)
 {
        struct drm_i915_private *dev_priv = request->i915;
 
-       i915_gem_request_submit(request);
-
        GEM_BUG_ON(!IS_ALIGNED(request->tail, 8));
        I915_WRITE_TAIL(request->engine, request->tail);
+
+       i915_gem_request_submit(request);
 }
 
 static void i9xx_emit_breadcrumb(struct drm_i915_gem_request *req, u32 *cs)
Comment 40 yann 2017-03-06 14:36:37 UTC
Reference to Chris' patch: https://patchwork.freedesktop.org/series/20757/
Comment 41 Chris Wilson 2017-03-06 14:42:37 UTC
The patch is bogus unfortunately. Still trying to find an explanation.
Comment 42 Chris Wilson 2017-03-08 12:28:29 UTC
*** Bug 100110 has been marked as a duplicate of this bug. ***
Comment 43 Chris Wilson 2017-03-17 20:26:29 UTC
Just documenting my continual failure here. Having run with DEBUG_GEM enabled, you also showed that this assert doesn't fire:

void __i915_gem_request_submit(struct drm_i915_gem_request *request)
{
        GEM_BUG_ON(i915_seqno_passed(intel_engine_get_seqno(engine), seqno));
}

which is the scenario I was worrying about in comment 39. I still need to write a test to see if the tail write goes backwards.
Comment 44 Chris Wilson 2017-03-18 17:29:07 UTC
Test to see if we ever write requests out-of-order:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index be908e2a52ea..da610ce176a9 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -784,6 +784,16 @@ static void i9xx_submit_request(struct drm_i915_gem_request *request)
 
        i915_gem_request_submit(request);
 
+       {
+               u32 head = I915_READ_HEAD(request->engine) & HEAD_ADDR;
+               u32 tail = I915_READ_TAIL(request->engine) & HEAD_ADDR;
+               int prev = __intel_ring_space(tail, head, request->ring->size);
+               int next = __intel_ring_space(request->tail, head, request->ring->size);
+               WARN(head != tail && next <= prev,
+                    "Bacwards we go: head=%x, tail=%x, next=%x\n",
+                    head, tail, request->tail);
+       }
+
        GEM_BUG_ON(!IS_ALIGNED(request->tail, 8));
        I915_WRITE_TAIL(request->engine, request->tail);
 }
Comment 45 Chris Wilson 2017-03-18 17:30:18 UTC
Created attachment 130303 [details] [review]
Test to see if TAIL writes go backwards
Comment 46 Martin Steigerwald 2017-03-18 20:01:31 UTC
Chris, thank you on working to stabilize snb. I appreciate it.
Comment 47 Josh Holland 2017-03-20 20:36:25 UTC
Built from d8839e27a with all patches except comment 39, got a hang.
Comment 48 Josh Holland 2017-03-20 20:37:08 UTC
Created attachment 130328 [details]
dmesg from d8839e27a
Comment 49 Josh Holland 2017-03-20 20:37:28 UTC
Created attachment 130329 [details]
GPU dump from d8839e27a
Comment 50 Chris Wilson 2017-03-20 21:08:46 UTC
(In reply to Josh Holland from comment #49)
> Created attachment 130329 [details]
> GPU dump from d8839e27a

The HEAD is past the TAIL, but the seqno is old. The batch buffer address is consistent with the instructions in the ring at HEAD. The oddity this time is that the seqno do not match up with the supposed execution through the ring.
Comment 51 Chris Wilson 2017-03-21 14:25:40 UTC
Very minor, but

commit fe085f13c7901203445fd2ab26c0f499313b8258
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Mar 21 10:25:52 2017 +0000

    drm/i915: Remove intel_ring.last_retired_head

may help clarify the expected values in the error state.
Comment 52 Chris Wilson 2017-03-29 14:08:55 UTC
Hmm, just seen some similar symptoms in bug 100484 where HEAD > TAIL and seqno stopped updating long before. The tale tell is the context was blank and the failure occurred just after it was reloaded.
Comment 53 Chris Wilson 2017-03-29 14:10:18 UTC
(In reply to Chris Wilson from comment #52)
> Hmm, just seen some similar symptoms in bug 100484 where HEAD > TAIL and
> seqno stopped updating long before. The tale tell is the context was blank
> and the failure occurred just after it was reloaded.

Sadly that doesn't appear to be the case here, the contexts here seem to have content (so hopefully valid content!)
Comment 54 Chris Wilson 2017-03-29 19:51:01 UTC
*** Bug 100454 has been marked as a duplicate of this bug. ***
Comment 55 Chris Wilson 2017-05-03 10:18:49 UTC
Can you please test with https://patchwork.freedesktop.org/patch/154241/ and GEM debugging enabled? It's a very rare possibility that we may have placed the RING_TAIL on the same cacheline as RING_HEAD.
Comment 56 Josh Holland 2017-05-05 22:17:28 UTC
Will do. It looks like that patch is already in drm-tip fb550f864? I'm also assuming previous patches to drivers/gpu/drm/i915/intel_ringbuffer.c are no longer relevant, since they don't apply on top of current drm-tip.
Comment 57 Josh Holland 2017-05-24 14:24:25 UTC
Got a hang (drm-tip d6a919d39), this time not in Chrome. I had the following DRM-related kernel config (gathered from previous comments and posted here for the next time I accidentally delete .config), hopefully I didn't leave anything important turned off.

#
# drm/i915 Debugging
#
CONFIG_DRM_I915_WERROR=y
CONFIG_DRM_I915_DEBUG=y
CONFIG_DRM_I915_DEBUG_GEM=y
CONFIG_DRM_I915_SW_FENCE_DEBUG_OBJECTS=y
# CONFIG_DRM_I915_SW_FENCE_CHECK_DAG is not set
CONFIG_DRM_I915_SELFTEST=y
# CONFIG_DRM_I915_LOW_LEVEL_TRACEPOINTS is not set
# CONFIG_DRM_I915_DEBUG_VBLANK_EVADE is not set
Comment 58 Josh Holland 2017-05-24 14:25:05 UTC
Created attachment 131468 [details]
dmesg from d6a919d39
Comment 59 Josh Holland 2017-05-24 14:25:34 UTC
Created attachment 131469 [details]
GPU dump from d6a919d39
Comment 60 Chris Wilson 2017-05-24 14:46:05 UTC
(In reply to Josh Holland from comment #57)
> Got a hang (drm-tip d6a919d39), this time not in Chrome. I had the following
> DRM-related kernel config (gathered from previous comments and posted here
> for the next time I accidentally delete .config), hopefully I didn't leave
> anything important turned off.

I think that's actually a genuine userspace hang (in mesa), checking the location of the RING_HEAD is consistent with our expectation and it doesn't seem to have the same stray ACTHD or wacky retirements as earlier. (Still slightly cautious as it took a while to spot the strange behaviour originally, and I may be missing it here.)
Comment 61 Josh Holland 2017-06-17 14:29:40 UTC
I've been running Mesa 17.1 (rather than Ubuntu Xenial's Mesa 12) for nearly three weeks now, and I'm inclined to agree -- I still have graphical issues with stuff flickering, especially in Chrome, and dmesg still has the occasional "Atomic update failure on pipe A", but I haven't had a single GPU hang on Mesa 17 AFAIR.
Comment 62 Josh Holland 2017-06-27 15:50:30 UTC
Scratch that, I switched back to Ubuntu's kernel (4.4.0-79) from drm-tip and I got a hang after a week. Uploading error state for completeness' sake.
Comment 63 Josh Holland 2017-06-27 15:51:52 UTC
Created attachment 132284 [details]
dmesg (Ubuntu kernel 4.4.0-79, Mesa 17.1.2)
Comment 64 Josh Holland 2017-06-27 15:52:44 UTC
Created attachment 132285 [details]
GPU dump (Ubuntu kernel 4.4.0-79, Mesa 17.1.2)
Comment 65 Chris Wilson 2017-08-08 21:10:15 UTC
*** Bug 102120 has been marked as a duplicate of this bug. ***
Comment 66 Chris Wilson 2017-10-23 10:41:03 UTC
*** Bug 103407 has been marked as a duplicate of this bug. ***
Comment 67 dev66 2017-10-27 12:42:29 UTC
Hello,

As indicated above, my bug report 103407 (also GPU hang on latest opensuse tumbleweed and lenovo x201 and x220) is declared a duplicate of this bug here.

reading above, I see that the issues still is open and i'd like to ask here whether there is anything we can do about it?

TIA
Comment 68 Josh Holland 2018-02-04 02:59:59 UTC
(In reply to dev66 from comment #67)
> reading above, I see that the issues still is open and i'd like to ask here
> whether there is anything we can do about it?
> 
> TIA

Unfortunately seems not, I am also still getting this occasionally.

It does now only seem to happen to me when the machine is under memory pressure/high cpu temperatures, so I wonder if it could be a hardware problem.
Comment 69 Jani Saarinen 2018-03-29 07:11:29 UTC
First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Comment 70 Jani Saarinen 2018-04-25 06:38:21 UTC
Closing, please re-open is issue still exists.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.