Bug 76554 - [gm45] [drm:init_ring_common]: *ERROR* render ring initialization failed
Summary: [gm45] [drm:init_ring_common]: *ERROR* render ring initialization failed
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 77977 86067 (view as bug list)
Depends on:
Blocks:
 
Reported: 2014-03-24 13:39 UTC by Jiri Kosina
Modified: 2017-07-24 22:55 UTC (History)
11 users (show)

See Also:
i915 platform:
i915 features:


Attachments
Move all ring resets before setting the HWS patch (3.53 KB, patch)
2014-03-24 15:12 UTC, Chris Wilson
no flags Details | Splinter Review
Xorg.0.log from the broken resume (with Chris' patch applied on top of drm-intel-testing). (54.35 KB, text/plain)
2014-03-24 16:56 UTC, Jiri Kosina
no flags Details
Explicitly stop the rings before resetting (3.34 KB, patch)
2014-03-24 17:38 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg with drm.debug=7 (280.20 KB, application/octet-stream)
2014-03-24 17:48 UTC, Jiri Kosina
no flags Details
Mark device as wedged if we fail to resume (1023 bytes, patch)
2014-03-24 17:58 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg-2 (178.25 KB, application/octet-stream)
2014-03-24 18:48 UTC, Jiri Kosina
no flags Details
Report EIO after resume failure in execbuffer (1.73 KB, patch)
2014-03-24 21:17 UTC, Chris Wilson
no flags Details | Splinter Review
Preserve ring buffers across resume (8.71 KB, patch)
2014-03-26 11:47 UTC, Chris Wilson
no flags Details | Splinter Review
drm.debug=7 dmesg with patched kernel (138.96 KB, text/plain)
2014-04-02 11:46 UTC, Jiri Kosina
no flags Details
drm.debug=7 dmesg with patched kernel (cfa8aaa3) (260.45 KB, text/plain)
2014-04-02 13:43 UTC, Jiri Kosina
no flags Details
Print ring registers for debugging (1.41 KB, patch)
2014-04-07 09:25 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg with ring contets dump before/after initialization (18.60 KB, application/octet-stream)
2014-04-07 11:33 UTC, Jiri Kosina
no flags Details
Poke the ring to see if it is awake (933 bytes, patch)
2014-04-07 11:45 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg with ring contents dump and MI_NOOP writes issued (18.97 KB, application/octet-stream)
2014-04-07 11:57 UTC, Jiri Kosina
no flags Details
dmesg with all the patches up to now applied (70.88 KB, text/plain)
2014-04-22 11:23 UTC, Jiri Kosina
no flags Details
dmesg with HEAD==218bb0e7f (70.94 KB, text/plain)
2014-04-22 12:32 UTC, Jiri Kosina
no flags Details
dmesg with fixed start/head (80.96 KB, text/plain)
2014-04-22 13:55 UTC, Jiri Kosina
no flags Details
Retry ring initialisation (1.60 KB, patch)
2014-04-22 15:47 UTC, Chris Wilson
no flags Details | Splinter Review
dmesg with retry-patch applied (74.34 KB, text/plain)
2014-04-22 20:49 UTC, Jiri Kosina
no flags Details
Prevent updating the HWS whilst it is active (1.26 KB, patch)
2014-05-16 20:01 UTC, Chris Wilson
no flags Details | Splinter Review
frob ring_stop a bit (999 bytes, patch)
2014-08-06 21:43 UTC, Daniel Vetter
no flags Details | Splinter Review
frob + debug ring head (2.53 KB, patch)
2014-08-07 08:54 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg with patch from comment #80 (79.86 KB, text/plain)
2014-08-07 09:31 UTC, Jiri Kosina
no flags Details
dmesg of 3.16 + patch from comment #80 (74.05 KB, text/plain)
2014-08-07 09:51 UTC, Jiri Kosina
no flags Details
[PATCH] drm/i915: read HEAD register back in init_ring_common() to enforce ordering (1.19 KB, patch)
2014-08-07 12:22 UTC, Jiri Kosina
no flags Details | Splinter Review
dmesg with the frob+debug patch from comment #80 showing the issue (101.18 KB, patch)
2014-08-07 13:46 UTC, Jiri Kosina
no flags Details | Splinter Review
head start before enabling (1.12 KB, patch)
2014-08-07 14:04 UTC, Daniel Vetter
no flags Details | Splinter Review
dmesg 3.16.2 (151.76 KB, text/plain)
2014-09-18 16:33 UTC, Martin Bednar
no flags Details
Extended-Revert-drm-i915-Move-all-ring-resets-before-setting-the-HWS-page.patch (5.87 KB, text/plain)
2014-10-12 01:44 UTC, Manuel Krause
no flags Details

Description Jiri Kosina 2014-03-24 13:39:25 UTC
This is not reliably reproducible, so not possible to do proper bisect.

There are suspend-resume cycles which end up with Xorg misbehaving (graphics not redrawing properly, etc), and dmesg contains

[drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head ffffff8804 tail 00000000 start 000e4000

This is both with current Linus' tree (HEAD 774868c70) and the issue is still present even after merging drm-intel-testing (HEAD 14347c2) into it.

There is currently ongoing mailinglist discussion here:

    https://lkml.org/lkml/2014/2/27/183
Comment 1 Chris Wilson 2014-03-24 15:06:24 UTC
OpenGL should be dead after resume, but the DDX should still behave -- everything should be accessible following a failed resume. Can you please attach your Xorg.0.log after such a failed resume?
Comment 2 Chris Wilson 2014-03-24 15:12:19 UTC
Created attachment 96294 [details] [review]
Move all ring resets before setting the HWS patch

Out of curiousity, can you try?
Comment 3 Jiri Kosina 2014-03-24 16:55:20 UTC
(In reply to comment #2)
> Created attachment 96294 [details] [review] [review]
> Move all ring resets before setting the HWS patch
> 
> Out of curiousity, can you try?

This actually seems to make things substantially worse -- out of two suspend-resume cycles with the kernel that had this patch applied (on top of drm-intel-testing), in both cases the issue triggered.
Comment 4 Jiri Kosina 2014-03-24 16:56:22 UTC
Created attachment 96298 [details]
Xorg.0.log from the broken resume (with Chris' patch applied on top of drm-intel-testing).
Comment 5 Chris Wilson 2014-03-24 17:15:57 UTC
Hmm, a piece of UXA state became corrupt (likely an invalid fb object or something). How does SNA fare? In particular, we can then run the DDX with --enable-debug=full to see what goes wrong. Or we might be able to spot it from a drm.debug=7 dmesg.
Comment 6 Chris Wilson 2014-03-24 17:16:38 UTC
As for the kernel patch, that's weird... Presumably it is then the order in which the ring registers are written.
Comment 7 Chris Wilson 2014-03-24 17:38:11 UTC
Created attachment 96304 [details] [review]
Explicitly stop the rings before resetting

One last idea to try on top of the previous patch is to wait for ring-idle first.
Comment 8 Jiri Kosina 2014-03-24 17:48:17 UTC
Created attachment 96305 [details]
dmesg with drm.debug=7

Attached is a dmesg with drm.debug=7 from the resume that had the problem (had to gzip it due to size).

I had to increase ringbuffer size due to the flood of

  WARNING: CPU: 1 PID: 111 at drivers/gpu/drm/drm_modes.c:119 drm_mode_probed_add+0x51/0x60 [drm]()

which are new since I merged drm-intel-testing -- those are not there with Linus' tree, but the ringbuffer issue still happens. I will report the WARNs separately later.
Comment 9 Jiri Kosina 2014-03-24 17:49:33 UTC
The dmesg from comment#8 is from kernel that didn't yet have patch from comment#7 applied. I will be testing that ASAP, thanks.
Comment 10 Chris Wilson 2014-03-24 17:50:52 UTC
[   45.141243] [drm:drm_ioctl], pid=1519, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2
[   45.141256] [drm:i915_gem_do_execbuffer], execbuf with invalid ring: 0
[   45.141260] [drm:drm_ioctl], ret = -22

Wow.
Comment 11 Chris Wilson 2014-03-24 17:58:32 UTC
Created attachment 96307 [details] [review]
Mark device as wedged if we fail to resume

This should help UXA to render correctly following the resume failure.
Comment 12 Jiri Kosina 2014-03-24 18:48:34 UTC
Created attachment 96311 [details]
dmesg-2

Unfortunately the patch from comment #11 didn't help either. Attaching dmesg of the failure with the patch applied.
Comment 13 Chris Wilson 2014-03-24 18:58:08 UTC
Hmm, UXA is being aggressively dumb. It even gets told the GPU is wedged, but ignores it.

The patch did the right thing, but UXA is still not able to notice since it doesn't check for errors when it should.
Comment 14 Chris Wilson 2014-03-24 21:17:54 UTC
Created attachment 96323 [details] [review]
Report EIO after resume failure in execbuffer
Comment 15 Chris Wilson 2014-03-26 11:47:29 UTC
Created attachment 96406 [details] [review]
Preserve ring buffers across resume

Another patch to apply on top of the first 3.
Comment 16 Jiri Kosina 2014-03-26 21:51:15 UTC
(In reply to comment #15)
> Created attachment 96406 [details] [review] [review]
> Preserve ring buffers across resume
> 
> Another patch to apply on top of the first 3.

What tree is this patch against please? I am getting rejects in drivers/gpu/drm/i915/intel_ringbuffer.c both in Linus' tree and in drm-intel-next branch of drm-intel tree.
Comment 17 Chris Wilson 2014-03-31 07:51:37 UTC
I've rebased the patches against drm-intel-nightly so they should apply to most recent kernel trees:

http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug76554
Comment 18 Jiri Kosina 2014-04-01 15:43:49 UTC
Built a kernel pulled from 

  git://people.freedesktop.org/~ickle/linux-2.6 bug76554

with topmost commit being

   commit 1318add417cf6c9dba373393e5b7be62e3283c84
   Author: Chris Wilson <chris@chris-wilson.co.uk>
   Date:   Mon Mar 24 19:17:11 2014 +0000

       drm/i915: Allow the module to load even if we fail to setup rings

but unfortunately the symptoms on resume from hibernation are exactly still the same.
Comment 19 Chris Wilson 2014-04-01 20:19:43 UTC
For reference, can you please attach the drm.debug=7 from the branch across resume? At the least it should have prevented UXA from freezing. :|
Comment 20 Jiri Kosina 2014-04-02 11:46:14 UTC
Created attachment 96776 [details]
drm.debug=7 dmesg with patched kernel

Attached is drm.debug=7 dmesg demonstrating the problem happening everything from

    git://people.freedesktop.org/~ickle/linux-2.6 bug76554

(HEAD == 1318add417c) applied.
Comment 21 Chris Wilson 2014-04-02 11:57:04 UTC
Ah, oops missed a patch from that branch to prevent the execbuffer from quietly suceeding. That explains why UXA kept on failing, but not why the rings still will not restart.
Comment 22 Chris Wilson 2014-04-02 12:09:45 UTC
One more random rearrangement that should apply on top of that branch:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 602432eaf346..bbcd6b5446f3 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -456,9 +456,9 @@ static bool stop_ring(struct intel_ring_buffer *ring)
 		}
 	}
 
+	I915_WRITE_CTL(ring, 0);
 	I915_WRITE_HEAD(ring, 0);
 	ring->write_tail(ring, 0);
-	I915_WRITE_CTL(ring, 0);
 
 	if (!IS_GEN2(ring->dev)) {
 		(void)I915_READ_CTL(ring);
@@ -513,18 +513,19 @@ static int init_ring_common(struct intel_ring_buffer *ring)
 	I915_WRITE_CTL(ring,
 			((ring->size - PAGE_SIZE) & RING_NR_PAGES)
 			| RING_VALID);
+	I915_WRITE_HEAD(ring, 0);
+	ring->write_tail(ring, 0);
 
 	/* If the head is still not zero, the ring is dead */
 	if (wait_for((I915_READ_CTL(ring) & RING_VALID) != 0 &&
 		     I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj) &&
 		     (I915_READ_HEAD(ring) & HEAD_ADDR) == 0, 50)) {
 		DRM_ERROR("%s initialization failed "
-				"ctl %08x head %08x tail %08x start %08x\n",
-				ring->name,
-				I915_READ_CTL(ring),
-				I915_READ_HEAD(ring),
-				I915_READ_TAIL(ring),
-				I915_READ_START(ring));
+			  "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08x]\n",
+			  ring->name,
+			  I915_READ_CTL(ring), I915_READ_CTL(ring) & RING_VALID,
+			  I915_READ_HEAD(ring), I915_READ_TAIL(ring),
+			  I915_READ_START(ring), i915_gem_obj_ggtt_offset(obj));
 		ret = -EIO;
 		goto out;
 	}

You may also want to cherry-pick ec9da60002b2390a3932db36d61d1d4e30c4ee21 from the bug76554 branch to prevent uxa from freezing.
Comment 23 Jiri Kosina 2014-04-02 12:52:34 UTC
I refetched the git branch to manually apply the reordering patch on top of it (bugzilla is damaging it, could you please attach it next time? thanks), but the branch doesn't build any more:

drivers/gpu/drm/i915/intel_ringbuffer.c: In function ‘stop_ring’:
drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: ‘drm_i915_private_t’ undeclared (first use in this function)
drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: (Each undeclared identifier is reported only once
drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: for each function it appears in.)
drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: ‘dev_priv’ undeclared (first use in this function)
make[2]: *** [drivers/gpu/drm/i915/intel_ringbuffer.o] Error 1
make[2]: *** Waiting for unfinished jobs....
drivers/gpu/drm/i915/i915_gem.c: In function ‘i915_gem_stop_ringbuffers’:
drivers/gpu/drm/i915/i915_gem.c:4240: error: ‘drm_i915_private_t’ undeclared (first use in this function)
drivers/gpu/drm/i915/i915_gem.c:4240: error: (Each undeclared identifier is reported only once
drivers/gpu/drm/i915/i915_gem.c:4240: error: for each function it appears in.)
drivers/gpu/drm/i915/i915_gem.c:4240: error: ‘dev_priv’ undeclared (first use in this function)
drivers/gpu/drm/i915/i915_gem.c:4241: warning: ISO C90 forbids mixed declarations and code
drivers/gpu/drm/i915/i915_gem.c:4244: warning: left-hand operand of comma expression has no effect
make[2]: *** [drivers/gpu/drm/i915/i915_gem.o] Error 1
make[1]: *** [drivers/gpu/drm/i915] Error 2
make: *** [drivers/gpu/drm/] Error 2


Topmost commit of the branch is

   commit ec9da60002b2390a3932db36d61d1d4e30c4ee21
   Author: Chris Wilson <chris@chris-wilson.co.uk>
   Date:   Mon Mar 24 17:56:36 2014 +0000
Comment 24 Chris Wilson 2014-04-02 13:16:54 UTC
Bleh, rebase error. All suggested patches are now up on #bug76554.
Comment 25 Chris Wilson 2014-04-02 13:19:21 UTC
#bug76554 head is currently

commit cfa8aaa35f180268c99e72964228c944930af680
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 2 13:37:24 2014 +0100
Comment 26 Jiri Kosina 2014-04-02 13:43:29 UTC
Created attachment 96783 [details]
drm.debug=7 dmesg with patched kernel (cfa8aaa3)

With the branch that has cfa8aaa3 as a topmost commit, the ring initialization failures are still popping up on resume, but Xorg rendering turning into complete mess is finally solved, and the Xorg session is not corrupted and works! (althrough it feels like the whole things is slower, but that might be due to excessive logging going on).

dmesg with drm.debug=7 attached.

So if you are going to push anything of this upstream, please feel free to add my

   Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>

to it, although I assume the ring initialization failure still needs to be solved ... ?

Thanks!
Comment 27 Chris Wilson 2014-04-02 13:47:59 UTC
(In reply to comment #26)
> Created attachment 96783 [details]
> drm.debug=7 dmesg with patched kernel (cfa8aaa3)
> 
> With the branch that has cfa8aaa3 as a topmost commit, the ring
> initialization failures are still popping up on resume, but Xorg rendering
> turning into complete mess is finally solved, and the Xorg session is not
> corrupted and works! (althrough it feels like the whole things is slower,
> but that might be due to excessive logging going on).

Indeed. What happens is that UXA now finally detects that the kernel is reporting that it cannot execute GPU commands, and instead it falls back to CPU rendering directly into the framebuffer.

> dmesg with drm.debug=7 attached.
> 
> So if you are going to push anything of this upstream, please feel free to
> add my
> 
>    Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz>
> 
> to it, although I assume the ring initialization failure still needs to be
> solved ... ?

Yes. We never knew why g45 failed in the first place, if we can figure out what changed now, we may be able to create a better band-aid.
Comment 28 Jiri Kosina 2014-04-02 13:58:29 UTC
(In reply to comment #27)
> Indeed. What happens is that UXA now finally detects that the kernel is
> reporting that it cannot execute GPU commands, and instead it falls back to
> CPU rendering directly into the framebuffer.

Understood, thanks. So kernel should probably put a huge warning into dmesg once such condition is detected and workaround applied.

> > to it, although I assume the ring initialization failure still needs to be
> > solved ... ?
> 
> Yes. We never knew why g45 failed in the first place, if we can figure out
> what changed now, we may be able to create a better band-aid.

Excellent, thanks. Happy to test any diag patches necessary.
Comment 29 Jiri Kosina 2014-04-03 14:33:51 UTC
BTW, may I kindly ask you what your plans with those patches are?

Although it's clear that root-causing the ring initialization failures is still the priority, without having this kind of bandaid present in the Linus' tree, it's almost completely useless on my system.

Thanks.
Comment 30 Chris Wilson 2014-04-03 15:33:00 UTC
The temporary fix is on its way upstream (under review atm), as keeping the system limping along is essential.
Comment 31 Daniel Vetter 2014-04-05 10:56:47 UTC
Now that the proper fallback handling is on track, have we attempted to bisect where the underlying root-cause (ring init failure on resume) was made much worse? I guess on some older kernels this worked better.

No guarantee that it'll help since this gm45 ring init issue is really ellusive, but it might shed some light on what's going on.
Comment 32 Jiri Kosina 2014-04-07 07:34:29 UTC
(In reply to comment #31)
> Now that the proper fallback handling is on track, have we attempted to
> bisect where the underlying root-cause (ring init failure on resume) was
> made much worse? I guess on some older kernels this worked better.

I am afraid this is close to impossible.

The frequency of the problem happening fluctuates *a lot* between different kernel.

- I am pretty sure that I've *never ever* seen it happening on 3.7 kernel, and it has been excercised a lot on the system in question

- Around 3.13, this seems to happen in a rather "time to time" manner (say once in 40 resumes, but with rather large standard deviation)

- with current Linus' tree and with the drm tree as well, this happens super-reliably on almost every resume from hibernation

I don't have enough data from the kernels in between to be able claim the ratio reliably.

I am afraid this pretty much implies that bisecting this reliably would consume incredible amount of time and might still produce unreliable result.
Comment 33 Chris Wilson 2014-04-07 09:25:28 UTC
Created attachment 97027 [details] [review]
Print ring registers for debugging

I think this might help in working out what the values in the registers mean. I think it is sticking to the old value, but I am not sure, hence the patch.
Comment 34 Jiri Kosina 2014-04-07 11:33:46 UTC
Created attachment 97034 [details]
dmesg with ring contets dump before/after initialization

This is a dmesg from resume where ring initialization fails with all the patches (including the before/after ring contents dump) posted here so far applied.
Comment 35 Chris Wilson 2014-04-07 11:41:31 UTC
That's scary. The immediate read of RING_HEAD after it returned 0 during the first initialisation returns a non-zero value... It only just barely passed the self-checks during module load. Just as importantly, it did not have the pattern I was expecting.

I think we should try emitting a dummy command and seeing if the CS ring updates.
Comment 36 Chris Wilson 2014-04-07 11:45:19 UTC
Created attachment 97035 [details] [review]
Poke the ring to see if it is awake

Maybe this is enough to see if the ring responds correctly. Please keep the ring debug patch in place.
Comment 37 Jiri Kosina 2014-04-07 11:57:01 UTC
Created attachment 97036 [details]
dmesg with ring contents dump and MI_NOOP writes issued

Unfortunately the error is still there even with the MI_NOOP writes. dmesg with that (and all the previous patches) applied is attached.
Comment 38 Jiri Kosina 2014-04-14 13:53:23 UTC
So, is there anything else I should try, given that bisecting is not really a viable option here, please?
It's rather annoying bug and it's my intention to help as much as possible to have it sorted out.
Comment 39 Chris Wilson 2014-04-21 07:37:22 UTC
Hmm. I missed that the "after initialisation" printk is correct. So perhaps all we need is to wait a little longer...


diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 2eb85cc2062f..5a74986348c6 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -537,7 +537,7 @@ static int init_ring_common(struct intel_ring_buffer *ring)
        /* If the head is still not zero, the ring is dead */
        if (wait_for((I915_READ_CTL(ring) & RING_VALID) != 0 &&
                     I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj) &&
-                    (I915_READ_HEAD(ring) & HEAD_ADDR) == 8, 50)) {
+                    (I915_READ_HEAD(ring) & HEAD_ADDR) == 8, 1000)) {
                DRM_ERROR("%s initialization failed "
                          "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx]\n",
                          ring->name,
Comment 40 Jiri Kosina 2014-04-22 08:29:41 UTC
(In reply to comment #39)
> Hmm. I missed that the "after initialisation" printk is correct. So perhaps
> all we need is to wait a little longer...

Unfortunately the symptoms are still the same even with timeout == 1000:

[   54.108192] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 000e299c tail 00000008 start 000e4000 [expected 000e4000]
[   54.108201] Ring render ring after initialisation: 0001f001 000e299c 00000008 000e4000
Comment 41 Chris Wilson 2014-04-22 09:35:02 UTC
One last paste... (Apologies for any white space issues, this is just trying to be quick and dirty.)


diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 5a74986348c6..75365c1588fb 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -526,9 +526,28 @@ static int init_ring_common(struct intel_ring_buffer *ring)
         * also enforces ordering), otherwise the hw might lose the new ring
         * register values. */
        I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
+       if (wait_for(I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj),
+                    1000)) {
+               DRM_ERROR("%s initialization failed "
+                         "start %08x [expected %08lx]\n",
+                         ring->name,
+                         I915_READ_START(ring),
+                         (unsigned long)i915_gem_obj_ggtt_offset(obj));
+               ret = -EIO;
+               goto out;
+       }
+
        I915_WRITE_CTL(ring,
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
                        | RING_VALID);
+       if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) {
+               DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n",
+                         ring->name,
+                         I915_READ_CTL(ring),
+                         !!(I915_READ_CTL(ring) & RING_VALID));
+               ret = -EIO;
+               goto out;
+       }
Comment 42 Jiri Kosina 2014-04-22 11:23:05 UTC
Created attachment 97739 [details]
dmesg with all the patches up to now applied

Attaching dmesg with all patches (up to and including the one in comment #41) included with the error condition triggering.
Comment 43 Chris Wilson 2014-04-22 11:35:31 UTC
If it keeps resetting HEAD to a random value after switching the ring on, how does it ever work? :|

Another hack:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 5a74986348c6..e47324aa8963 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -530,6 +530,13 @@ static int init_ring_common(struct intel_ring_buffer *ring)
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
                        | RING_VALID);
 
+       if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) {
+               printk(KERN_ERR "%s initialization failed [%08x != %08x], fudging\n",
+                      ring->name, I915_READ_START(ring), i915_gem_obj_ggtt_offset(obj));
+               I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
+               POSTING_READ(ring);
+       }
+
        iowrite32(MI_NOOP, ring->virtual_start + 0);
        iowrite32(MI_NOOP, ring->virtual_start + 4);
        ring->write_tail(ring, 8);
Comment 44 Chris Wilson 2014-04-22 11:36:11 UTC
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 5a74986348c6..b46b3e928a7f 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer *ring)
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
                        | RING_VALID);
 
+       if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) {
+               printk(KERN_ERR
+                      "%s initialization failed"
+                      " [%08x != %08x], fudging\n",
+                      ring->name,
+                      I915_READ_START(ring),
+                      i915_gem_obj_ggtt_offset(obj));
+               I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
+               POSTING_READ(ring);
+       }
+
        iowrite32(MI_NOOP, ring->virtual_start + 0);
        iowrite32(MI_NOOP, ring->virtual_start + 4);
        ring->write_tail(ring, 8);
Comment 45 Jiri Kosina 2014-04-22 11:53:34 UTC
(In reply to comment #44)
> diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> b/drivers/gpu/drm/i915/intel_ringbuffer.c
> index 5a74986348c6..b46b3e928a7f 100644
> --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> @@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer
> *ring)
>                         ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
>                         | RING_VALID);
>  
> +       if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) {
> +               printk(KERN_ERR
> +                      "%s initialization failed"
> +                      " [%08x != %08x], fudging\n",
> +                      ring->name,
> +                      I915_READ_START(ring),
> +                      i915_gem_obj_ggtt_offset(obj));
> +               I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
> +               POSTING_READ(ring);
> +       }
> +
>         iowrite32(MI_NOOP, ring->virtual_start + 0);
>         iowrite32(MI_NOOP, ring->virtual_start + 4);
>         ring->write_tail(ring, 8);

What is a baseline I should apply this on top of, please? The surrounding code in my tree (with all the patches provided so far apples) is

[ ... ]
        I915_WRITE_CTL(ring,
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
                        | RING_VALID);
        if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) {
                DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n",
                                ring->name,
                                I915_READ_CTL(ring),
                                !!(I915_READ_CTL(ring) & RING_VALID));
                ret = -EIO;
                goto out;
        }
        I915_WRITE_HEAD(ring, 0);
        ring->write_tail(ring, 0);

        iowrite32(MI_NOOP, ring->virtual_start + 0);
        iowrite32(MI_NOOP, ring->virtual_start + 4);
        ring->write_tail(ring, 8);
[ ... ]

(i.e. it has the extra I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0);, etc).

I can of course easily apply the hunk just between the 

   ring->write_tail(ring, 0);

and 

   iowrite32(MI_NOOP, ring->virtual_start + 0);

if that's what you want me to do.

Thanks.
Comment 46 Chris Wilson 2014-04-22 12:12:17 UTC
(In reply to comment #45)
> (In reply to comment #44)
> > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > index 5a74986348c6..b46b3e928a7f 100644
> > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c
> > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
> > @@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer
> > *ring)
> >                         ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
> >                         | RING_VALID);
> >  
> > +       if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) {
> > +               printk(KERN_ERR
> > +                      "%s initialization failed"
> > +                      " [%08x != %08x], fudging\n",
> > +                      ring->name,
> > +                      I915_READ_START(ring),
> > +                      i915_gem_obj_ggtt_offset(obj));
> > +               I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
> > +               POSTING_READ(ring);
> > +       }
> > +
> >         iowrite32(MI_NOOP, ring->virtual_start + 0);
> >         iowrite32(MI_NOOP, ring->virtual_start + 4);
> >         ring->write_tail(ring, 8);
> 
> What is a baseline I should apply this on top of, please? The surrounding
> code in my tree (with all the patches provided so far apples) is
> 
> [ ... ]
>         I915_WRITE_CTL(ring,
>                         ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
>                         | RING_VALID);
>         if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) {
>                 DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n",
>                                 ring->name,
>                                 I915_READ_CTL(ring),
>                                 !!(I915_READ_CTL(ring) & RING_VALID));
>                 ret = -EIO;
>                 goto out;
>         }
>         I915_WRITE_HEAD(ring, 0);
>         ring->write_tail(ring, 0);
> 
>         iowrite32(MI_NOOP, ring->virtual_start + 0);
>         iowrite32(MI_NOOP, ring->virtual_start + 4);
>         ring->write_tail(ring, 8);
> [ ... ]
> 
> (i.e. it has the extra I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0);,
> etc).
> 
> I can of course easily apply the hunk just between the 
> 
>    ring->write_tail(ring, 0);
> 
> and 
> 
>    iowrite32(MI_NOOP, ring->virtual_start + 0);
> 
> if that's what you want me to do.
> 
> Thanks.

Sorry, I threw away the preceding hack to try and keep the diff clean. Just plonk the write to set HEAD again after setting CTRL (and the wait_for(CTRL) if you have that).

Hmm, it appears we have drifted slightly in our assortment of patches, let me push my current collection of hacks so we can rebase.
Comment 47 Chris Wilson 2014-04-22 12:16:21 UTC
Latest set of hacks and patches on top of drm-intel-nightly: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug76554
Comment 48 Jiri Kosina 2014-04-22 12:32:25 UTC
Created attachment 97744 [details]
dmesg with HEAD==218bb0e7f

The problem is still there with the referenced branch (SHA1 HEAD 218bb0e7f). Dmesg attached.
Comment 49 Chris Wilson 2014-04-22 12:38:15 UTC
So it passes the immediate check that HEAD is valid after setting CTRL, but then fails shortly afterwards. Humph. I am not sure what is going on!

I wonder if it is as simple as the combination of reads failing?
Comment 50 Jiri Kosina 2014-04-22 12:49:43 UTC
The problematic condition causing the whole ring to be claimed dead is

     I915_READ_HEAD(ring) & HEAD_ADDR) == 8

right?

I915_READ_HEAD(ring) returns 000e200c HEAD_ADDR is 0x001FFFFC, so the result is e200c, not the expected value of 8, causing the ring initialization failure.

Or am I completely wrong here?
Comment 51 Jiri Kosina 2014-04-22 13:02:07 UTC
(In reply to comment #50)

I probably wasn't super clear what I was referring to by this comment:

> The problematic condition causing the whole ring to be claimed dead is
> 
>      I915_READ_HEAD(ring) & HEAD_ADDR) == 8
> 
> right?
> 
> I915_READ_HEAD(ring) returns 000e200c HEAD_ADDR is 0x001FFFFC, so the result
> is e200c, not the expected value of 8, causing the ring initialization
> failure.

I was referring to this:


(In reply to comment #49)
> So it passes the immediate check that HEAD is valid after setting CTRL, but
> then fails shortly afterwards. Humph. I am not sure what is going on!

because I don't see any check for HEAD validity after settin CTRL; I only see

     I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)

check, but no I915_READ_HEAD() check ... but obviously, I am absolutely unfamiliar with this code, so sorry for creating unnecessary noise likely.
> 
> I wonder if it is as simple as the combination of reads failing?
Comment 52 Chris Wilson 2014-04-22 13:28:01 UTC
No, it is just me getting confused between HEAD and START. Ok, I wonder if this is the missing piece of magic (on top of the current bug branch):

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index b46b3e928a7f..12c59e945f8e 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -530,15 +530,14 @@ static int init_ring_common(struct intel_ring_buffer *ring)
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
                        | RING_VALID);
 
-       if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) {
+       if (I915_READ_HEAD(ring)) {
                printk(KERN_ERR
                       "%s initialization failed"
-                      " [%08x != %08x], fudging\n",
+                      " [head now %08x], fudging\n",
                       ring->name,
-                      I915_READ_START(ring),
-                      i915_gem_obj_ggtt_offset(obj));
-               I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
-               POSTING_READ(ring);
+                      I915_READ_HEAD(ring));
+               I915_WRITE_HEAD(ring, 0);
+               (void)I915_READ_HEAD(ring);
        }
 
        iowrite32(MI_NOOP, ring->virtual_start + 0);
Comment 53 Jiri Kosina 2014-04-22 13:55:57 UTC
Created attachment 97747 [details]
dmesg with fixed start/head

On the first resume, the issue didn't occur, but second suspend-resume cycle revealed it again. dmesg attached.
Comment 54 Chris Wilson 2014-04-22 15:43:29 UTC
After the first resume, we applied the fixup. After the second resume, it managed to get past the check and then failed. /o\
Comment 55 Chris Wilson 2014-04-22 15:47:07 UTC
Created attachment 97756 [details] [review]
Retry ring initialisation

And another hack!
Comment 56 Jiri Kosina 2014-04-22 20:49:34 UTC
Created attachment 97774 [details]
dmesg with retry-patch applied

dmesg with patch from comment#55 applied on top of the previous pile.

The only notable difference seems to be appearance of

  [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring

during resume.
Comment 57 Jiri Kosina 2014-05-06 23:21:13 UTC
Is there anything new on this front, please?
Comment 58 Chris Wilson 2014-05-08 10:22:35 UTC
I haven't had any other inspiration. Maybe,


diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 401f3e7..ccb0e5c 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -513,6 +513,7 @@ reset:
         * registers with the above sequence (the readback of the HEAD registers
         * also enforces ordering), otherwise the hw might lose the new ring
         * register values. */
+       memset(ring->virtual_start, 0, ring->size);
        I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj));
        I915_WRITE_CTL(ring,
                        ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
Comment 59 Jiri Kosina 2014-05-12 08:34:01 UTC
With that patch in place (on top of all previous patches), this is still in dmesg upon resume:

[   30.584016] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
[   30.584021] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f401 (valid? 1) head 000e202c tail 00000008 start 000e4000 [expected 000e4000]
[   30.584024] Ring render ring after initialisation: 0001f401 000e202c 00000008 000e4000
[   30.584034] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged!
Comment 60 Mika Kuoppala 2014-05-12 15:14:42 UTC
No good ideas here either, but would be nice to see if this makes a difference on
ring init:

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 4024e16..708a1da 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -573,6 +573,15 @@ static int i915_drm_thaw_early(struct drm_device *dev)
 static int __i915_drm_thaw(struct drm_device *dev, bool restore_gtt_mappings)
 {
        struct drm_i915_private *dev_priv = dev->dev_private;
+       int ret;
+
+       mutex_lock(&dev->struct_mutex);
+       ret = intel_gpu_reset(dev);
+       mutex_unlock(&dev->struct_mutex);
+
+       if (ret)
+               DRM_ERROR("failed to reset the GPU on resume (%d), ignoring\n",
+                         ret);
Comment 61 Jiri Kosina 2014-05-13 11:56:09 UTC
(In reply to comment #60)
> No good ideas here either, but would be nice to see if this makes a
> difference on
> ring init:

Even with the memset() patch from comment#60 applied on top of the previous bunch, I see this on resume:

[   54.300012] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring
[   54.300018] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f401 (valid? 1) head 000e252c tail 00000008 start 000e4000 [expected 000e4000]
[   54.300021] Ring render ring after initialisation: 0001f401 000e252c 00000008 000e4000
[   54.300031] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged!
Comment 62 Jiri Kosina 2014-05-13 11:57:57 UTC
(In reply to comment #61)
> (In reply to comment #60)
> > No good ideas here either, but would be nice to see if this makes a
> > difference on
> > ring init:
> 
> Even with the memset() 

memset() here should actually read intel_gpu_reset(), sorry for the confusion.
Comment 63 Antti Koskipaa 2014-05-13 12:23:15 UTC
Assigning to Chris since he seems to be all over it.
Comment 64 Chris Wilson 2014-05-16 20:01:09 UTC
Created attachment 99172 [details] [review]
Prevent updating the HWS whilst it is active

Stumbled across this. Probably irrelevant, but it is in the right area.
Comment 65 Chris Wilson 2014-05-19 11:45:19 UTC
*** Bug 77977 has been marked as a duplicate of this bug. ***
Comment 66 Jiri Kosina 2014-05-20 14:17:54 UTC
(In reply to comment #64)
> Created attachment 99172 [details] [review] [review]
> Prevent updating the HWS whilst it is active
> 
> Stumbled across this. Probably irrelevant, but it is in the right area.

Unfortunately this patch doesn't improve the behavior.
Comment 67 Martin Fahr 2014-06-11 11:35:13 UTC
A change somewhere in between 3.14 and 3.15 makes me hit this bug *almost* reliably. Bisecting it took me half a day and ended up pointing at commit [78f2975eec9faff353a6194e854d3d39907bab68 drm/i915]: Move all ring resets before setting the HWS page. As the title is the same as a patch posted here earlier, I suppose it is the exact same patch? It seems like what was meant to be a solution to the problem, actually makes it much worse (and maybe helps to find the root cause of it). 

If there's anything else I can do, just let me know.
Comment 68 alium 2014-06-17 13:19:19 UTC
It looks like I have the same problem. After upgrading the kernel to 3.15 / 3.15.1, and after suspend appears in dmesg error:

[   31.496713] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 000009c0 tail 00000000 start 000fd000

[   31.591596] PM: Device 0000:00:02.0 failed to resume async: error -5


I have G45 - X4500MHD, mesa 10.2.1, xf86-video-intel 2.99.912, libdrm 2.4.54.
Comment 69 Chris Wilson 2014-06-24 11:52:35 UTC
Fwiw, http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug76554 has everything we have tried so far.
Comment 70 Chris Wilson 2014-06-25 06:47:04 UTC
Here's something you can try on top of that branch:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 9ee4ab306134..4f3397f87152 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -539,13 +539,13 @@ reset:
                        goto reset;
 
                DRM_ERROR("%s initialization failed "
-                         "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx]\n",
+                         "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx], fudging\n",
                          ring->name,
                          I915_READ_CTL(ring), I915_READ_CTL(ring) & RING_VALID,
                          I915_READ_HEAD(ring), I915_READ_TAIL(ring),
                          I915_READ_START(ring), (unsigned long)i915_gem_obj_ggtt_offset(obj));
-               ret = -EIO;
-               goto out;
+
+               ring->write_tail(ring, I915_READ_HEAD(ring) & HEAD_ADDR);
        }
 
        if (!drm_core_check_feature(ring->dev, DRIVER_MODESET))


The idea is to ignore the failure and see if we can program the GPU anyway.
Comment 71 Jiri Kosina 2014-06-27 13:58:51 UTC
(In reply to comment #70)
> Here's something you can try on top of that branch:
[ ... snip ... ]
> The idea is to ignore the failure and see if we can program the GPU anyway.

This made things much worse.

X comes back after resume (i.e. the windows get drawed the exactly same way they were laid out during suspend), but afterwards, the system is completely dead. Even ctrl-alt-backspace doesn't kill X session, it's not possible to switch to text console.
Comment 72 Simon Kalteis 2014-07-08 13:56:42 UTC
Hi, I am currently on 3.16-rc4 and the GPU gets disabled right on load of the i915 module, no need to suspend/wake :-(

Xorg seems to draw fine - unaccelerated though, xv is not working, too (as one would expect).

The system is a Lenovo T500 with a GM45 chipset. If I can somehow help debug this by providing logs let me know...
Comment 73 Martin Bednar 2014-07-12 19:04:39 UTC
Hi,

I'm hitting this reliably on every resume on 3.15.5. (libdrm 2.4.54, mesa 10.2.3)
relevant dmesg output : 
juil. 11 17:28:23 Nemmerle kernel: [drm:i965_irq_handler] *ERROR* pipe B underrun
juil. 11 17:29:52 Nemmerle kernel: [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000268 tail 00000000 start 007510
juil. 11 17:29:52 Nemmerle kernel: dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -5
juil. 11 17:29:52 Nemmerle kernel: PM: Device 0000:00:02.0 failed to resume async: error -5

Kwin crashes really bad after this happens...

Not sure if the pipe B underrun is related, as that happens even without resuming.
This is on : 
00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07)
If I can help in any way, let me know...
Comment 74 cunio 2014-07-23 21:38:43 UTC
For me the problem occurs when my Lenovo G550 runs KDE4 with desktop effects on and on battery 

00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 09)
Comment 75 alium 2014-08-04 20:04:18 UTC
It seems that the bug is fixed in the kernel 3.16. For me, it works normally again.

Archlinux 
xorg-server 1.16.0-5
xf86-video-intel 2.99.914-1
libdrm 2.4.56-1
mesa 10.2.4-1
linux 3.16-1
Comment 76 Simon Kalteis 2014-08-04 21:36:02 UTC
Still on 3.16 -- as expected. Guess the proposed experiments for this go into the 3.17 merge window?

[    0.536244] [drm] Initialized drm 1.1.0 20060810
[    0.536720] [drm] Memory usable by graphics device = 2048M
[    0.536783] [drm] Replacing VGA console driver
[    0.537337] Console: switching to colour dummy device 80x25
[    0.543156] i915 0000:00:02.0: irq 44 for MSI/MSI-X
[    0.543168] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    0.543175] [drm] Driver supports precise vblank timestamp query.
[    0.543255] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    0.619102] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00000298 tail 00000000 start 000fd000 [expected 000fd000]
[    0.619114] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged
Comment 77 Jiri Kosina 2014-08-06 08:15:31 UTC
3.16 still doesn't work for me exactly the same way as before.
Comment 78 Daniel Vetter 2014-08-06 21:43:37 UTC
Created attachment 104181 [details] [review]
frob ring_stop a bit

I think this is something we haven't tried yet. dmesg with results highly welcome.
Comment 79 Simon Kalteis 2014-08-07 00:22:55 UTC
Sadly, the last patch does nothing for me.

simon@thinkpad:~$ dmesg | grep drm
[    0.539701] [drm] Initialized drm 1.1.0 20060810
[    0.540192] [drm] Memory usable by graphics device = 2048M
[    0.540256] [drm] Replacing VGA console driver
[    0.547166] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    0.547173] [drm] Driver supports precise vblank timestamp query.
[    0.624101] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00000204 tail 00000000 start 000fd000 [expected 000fd000]
[    0.624113] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged
[    0.648508] fbcon: inteldrmfb (fb0) is primary device
[    1.186237] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[    1.222687] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
Comment 80 Daniel Vetter 2014-08-07 08:54:30 UTC
Created attachment 104208 [details] [review]
frob + debug ring head

Ok, I'm honestly lost what's going on right now. Can you please retest with this patch, which has piles of debug output?
Comment 81 Jiri Kosina 2014-08-07 09:31:02 UTC
Created attachment 104212 [details]
dmesg with patch from comment #80

Here is a dmesg from two suspend-resume cycles with Daniel's patch from comment #80 (applied on top of all previous patches from Chris).

There seems to be a change in behavior!

Interestingly, I am not seeing

    [drm:init_ring_common] *ERROR* render ring initialization failed ....

messages any more, and seems like I am indeed running accelerated still (i.e. the graphic seems to be still reasonably fast and usable).

For some reason there is still 

    render ring initialization failed [head now 01000000], fudging

message present though.

I will now try the patch from comment #80 on top of something more recent than the 3.15-based branch with all Chris' patches applied, and will let you know.
Comment 82 Jiri Kosina 2014-08-07 09:51:47 UTC
Created attachment 104216 [details]
dmesg of 3.16 + patch from comment #80

Woohooo! Daniel, your patch from comment #80 seems to make the bug go away!

Attached is dmesg from 3.16 + patch from comment #80. No more ring initialization errors in dmesg, everything working properly.

I'll do a couple more suspend/resume cycles to be sure that the problem just doesn't happen less frequently and will report back.

So either it's the wait_for_atomic() -> wait_for() change, the extra wait_for() added in order to wait for the head to be cleared, or the extra I915_READ_HEAD() reads inserted.

Unless you now have a clear idea of what's happening, I'll try to isolate which of the changes in the patch is the one that makes the difference.
Comment 83 Jiri Kosina 2014-08-07 12:22:29 UTC
Created attachment 104224 [details] [review]
[PATCH] drm/i915: read HEAD register back in init_ring_common() to enforce ordering

Ok, this is the minimal change that reliably makes my system behave properly again finally (woohoo!).

Chris, Daniel, what do you think?
Comment 84 Jiri Kosina 2014-08-07 12:55:21 UTC
Okay, after 31 suspend-resume cycles, the problem appeared again (while   
without the patch, it triggers with 100% reliability) with patch from comment #83 applied.

So it's not a complete fix, it just makes the problem much less likely to happen.
Comment 85 Chris Wilson 2014-08-07 12:59:48 UTC
The intel_ring_setup_status_page() does a posting read anyway, so it is not an ordering issue. So this is back in the magic read territory, can you try with a msleep(10) instead of the read just to confirm that it is the read doing the trick and not an extra delay?
Comment 86 Daniel Vetter 2014-08-07 13:03:53 UTC
It's still very strange that HEAD starts to move once we've initialized the ring. So it seems like we can properly reset it, but then it goes banas ...

More dmesgs from different machines with that frob+debug patch definitely appreciated.
Comment 87 Jiri Kosina 2014-08-07 13:14:57 UTC
(In reply to comment #85)
> The intel_ring_setup_status_page() does a posting read anyway, so it is not
> an ordering issue. So this is back in the magic read territory, can you try
> with a msleep(10) instead of the read just to confirm that it is the read
> doing the trick and not an extra delay?

With msleep() used instead of the register read, the problem triggers fully reliably again; i.e. the "magic read" really does some trick, although it's not a complete cure.
Comment 88 Jiri Kosina 2014-08-07 13:16:51 UTC
(In reply to comment #86)
> It's still very strange that HEAD starts to move once we've initialized the
> ring. So it seems like we can properly reset it, but then it goes banas ...
> 
> More dmesgs from different machines with that frob+debug patch definitely
> appreciated.

I will provide you with dmesg output from failing resume with your frob+debug patch once the issue triggers with it applied (it hasn't so far).
Comment 89 Jiri Kosina 2014-08-07 13:46:19 UTC
Created attachment 104226 [details] [review]
dmesg with the frob+debug patch from comment #80 showing the issue

Finally after many suspend-resume cycles, the issue triggered also with Daniel's frob+debug patch from comment #80.

Resulting dmesg attached.
Comment 90 Daniel Vetter 2014-08-07 13:59:45 UTC
Jiri, can you please submit your patch from commment #83 to upstream? It's not perfect, but ducttape is good, so I'll merge it as an interim solution.
Comment 91 Jiri Kosina 2014-08-07 14:04:06 UTC
(In reply to comment #90)
> Jiri, can you please submit your patch from commment #83 to upstream? It's
> not perfect, but ducttape is good, so I'll merge it as an interim solution.

I can definitely do that if that's your preferred course of action. 

This will mean that smaller number of people will be hitting the bug and hence being available to test proper fixes (hopefully the dmesg from comment #89 will provoke some idea?). 

OTOH, I will always be here to test any patches to have a final fix, so if that's enough for you, then fine :)
Comment 92 Daniel Vetter 2014-08-07 14:04:18 UTC
Created attachment 104227 [details] [review]
head start before enabling

Another crazy idea. Looking at logs and Jiri's patch, the critical step seems to be when we set the valid bit. Let's see what happens if we give the ring a headstart, hopefully catching the moving ring.

You can experiment with different values, as long as they're a multiple of 8. 64 might be magic since it's the cacheline size (which in a few w/a is really important for register writes, even though that's strange).
Comment 93 Jiri Kosina 2014-08-07 14:22:35 UTC
(In reply to comment #92)
> Created attachment 104227 [details] [review] [review]
> head start before enabling
> 
> Another crazy idea. Looking at logs and Jiri's patch, the critical step
> seems to be when we set the valid bit. Let's see what happens if we give the
> ring a headstart, hopefully catching the moving ring.
> 
> You can experiment with different values, as long as they're a multiple of
> 8. 64 might be magic since it's the cacheline size (which in a few w/a is
> really important for register writes, even though that's strange).

This patch causes another ring initialization failure, 100%, during boot (i.e. even no suspend-resume cycle necessary)

[    3.496122] [drm:init_ring_common] *ERROR* bsd ring initialization failed ctl 0001f001 (valid? 1) head 00000008 tail 00000040 start 00107000 [expected 00107000]
[    3.496256] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged
Comment 94 Chris Wilson 2014-08-07 14:59:50 UTC
One of the earlier patches is now available as a standalone module, http://patchwork.freedesktop.org/patch/31266/ as Ville found a suspiciously similar w/a for g4x.
Comment 95 Simon Kalteis 2014-08-07 16:06:24 UTC
No more failures on boot or resume here so far, using Jiri's one-liner. Will see if this is consistent.

Thanks all! :-)
Comment 96 Jiri Kosina 2014-08-08 13:00:02 UTC
(In reply to comment #95)
> No more failures on boot or resume here so far, using Jiri's one-liner. Will
> see if this is consistent.
> 
> Thanks all! :-)

Thanks for testing. Please bear in mind though that this is a workaround that makes the bug less likely to happen, but it's still possible that it triggers.



(In reply to comment #89)
> Created attachment 104226 [details] [review] [review]
> dmesg with the frob+debug patch from comment #80 showing the issue
> 
> Finally after many suspend-resume cycles, the issue triggered also with
> Daniel's frob+debug patch from comment #80.
> 
> Resulting dmesg attached.

Daniel, did that make any sense whatsoever to you? There is obvious difference in the value of 'After init' 0x01000000 (working case) vs. 0x000e4004 (broken case), and nothing else pops up to me.
Comment 97 Jiri Kosina 2014-08-13 13:03:17 UTC
I'll be having the affected notebook with me next week in Chicago on Kernel Summit in case it'd help you with debugging ... ?
Comment 98 Diego Viola 2014-08-16 03:41:06 UTC
Hello.

I'm getting this in dmesg:

[   12.413399] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged

OpenGL is broken in my system, Xv is also broken.

I can't watch any videos in mpv/mplayer/vlc with vo_xv. Also, glxgears returns this:

[diego@myhost school]$ glxgears
Running synchronized to the vertical refresh.  The framerate should be
approximately the same as the monitor refresh rate.
intel_do_flush_locked failed: Invalid argument
[diego@myhost school]$ 


Is this the same bug or should I open another one?
Comment 99 Diego Viola 2014-08-16 03:55:35 UTC
Arch Linux (x86_64) here.
Comment 100 Simone Lazzaris 2014-08-19 13:04:55 UTC
Same here: 
Aug 19 08:59:09 localhost kernel: [    7.606690] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00009654 tail 00000000 start 000e4000 [expected 000e4000]
Aug 19 08:59:09 localhost kernel: [    7.606711] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged

Archlinux, AMD64, kernel 3.16.1
Comment 101 Simon Kalteis 2014-08-19 13:07:33 UTC
Jiri Kosina's patch that more or less fixes this (at least for now and on my system...) is already in 3.17-rc1. So you could either patch your current version or upgrade.
:-)
Comment 102 Diego Viola 2014-08-20 11:39:32 UTC
Will Linux 3.16.2 include this fix/workaround?
Comment 103 Jani Nikula 2014-08-20 14:05:23 UTC
Fixed by
commit ece4a17d237a79f63fbfaf3f724a12b6d500555c
Author: Jiri Kosina <jkosina@suse.cz>
Date:   Thu Aug 7 16:29:53 2014 +0200

    drm/i915: read HEAD register back in init_ring_common() to enforce ordering

(In reply to comment #102)
> Will Linux 3.16.2 include this fix/workaround?

It will eventually be backported to supported stable kernels, but when that happens depends on the stable team.
Comment 104 Jiri Kosina 2014-08-20 14:33:29 UTC
I suggest to keep the bug still open for quite some time.

We all know (and my stress-testing underlines that) that this is a duct-tape and not a real fix.
I am still planning to spend some more time on this.

If you hate having this bug assigned to Intel as you believe there's not much that you can do (which is my current understanding), please feel free to re-assign it to me.

I'll close it either once I am completely out of crazy ideas, or I find a reliable fix.

Thanks.
Comment 105 Jani Nikula 2014-09-05 12:54:25 UTC
(In reply to comment #104)
> I suggest to keep the bug still open for quite some time.

Ok, dropping regression and reducing priority.
Comment 106 Chris Wilson 2014-09-05 15:40:48 UTC
commit 95468892fdfeef6d1004b524e35957629efdbe00
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 7 15:39:54 2014 +0100

    drm/i915: Reset the HEAD pointer for the ring after writing START
    
    Ville found an old w/a documented for g4x that suggested that we need to
    reset the HEAD after writing START. This is a useful fixup for some of
    the g4x ring initialisation woes, but as usual, not all.
Comment 107 Jan Niggemann 2014-09-18 13:28:16 UTC
I applied Jiris patch (attachment 104224 [details] [review]) against a vanilla 3.16.3 and I can confirm that it mitigates the issue. I see no more "render ring initialization failed" errors, but there seem to be side effects, at least on my Thinkpad T400 (GM45, Debian stable).

I noticed that mouse movement becomes sluggish (without applications open, didn't find the reason) and as a showstopper, my whole system froze. I coudn't get out of X with ctrl-alt-backspace, and when I tried to switch to tty1 I briefly saw a stack dump before the screen went entirely black (not off).

Would you like a dmesg or a drm debug log?
If there are any other patches available, please tell me.
Comment 108 Martin Bednar 2014-09-18 16:33:06 UTC
Created attachment 106514 [details]
dmesg 3.16.2

Another dmesg. Kernel 3.16.2 (chakra, default).
Comment 109 Manuel Krause 2014-10-12 01:30:00 UTC
I may have have been bugged by this issue since 3.15/3.16 kernels, since:

---
From 78f2975eec9faff353a6194e854d3d39907bab68 Mon Sep 17 00:00:00 2001
From: Chris Wilson <chris@chris-wilson.co.uk>
Date: Wed, 2 Apr 2014 16:36:07 +0100
Subject: drm/i915: Move all ring resets before setting the HWS page

In commit a51435a3137ad8ae75c288c39bd2d8b2696bae8f
Author: Naresh Kumar Kachhi <naresh.kumar.kachhi@intel.com>
Date:   Wed Mar 12 16:39:40 2014 +0530

    drm/i915: disable rings before HW status page setup

we reordered stopping the rings to do so before we set the HWS register.
However, there is an extra workaround for g45 to reset the rings twice,
and for consistency we should apply that workaround before setting the
HWS to be sure that the rings are truly stopped.
---

I continuously got things like the following after SUSPEND-TO-DISK:
[53556.636015] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head ffff8814 tail 00000000 start 0013c000 [expected 0013c000]
[53556.636018] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged!

After which all xv video playing fails (blank==black smplayer/vlc content window). This one was with 3.16.5, which already includes Jiri's patch (in kernel since vanilla 3.16.4).
---

The only way to fix this issue for me, since I felt affected @3.15.7, was at first to revert commit 78f2975eec9faff353a6194e854d3d39907bab68 (see:https://bitbucket.org/alfredchen/linux-gc/commits/b462d396377b2765d1ed3b416bcf90dc02a85293/raw/) 
and to continuously follow it's children with kernel's evolution, to revert them too. Until now.

I'll attach this _working_ patch for 3.16.5.in the following.
In the hope of bringing some inspiration to you all, although it is somekind of history, already?!

If you'd need more info, please, only ask for it.

Best regards,
Manuel Krause
Comment 110 Manuel Krause 2014-10-12 01:44:42 UTC
Created attachment 107730 [details]
Extended-Revert-drm-i915-Move-all-ring-resets-before-setting-the-HWS-page.patch

For kernels from 3.16.4 and upwards you'd need -- at first -- to REVERT commit ece4a17d237a79f63fbfaf3f724a12b6d500555c BEFORE applying this.
Comment 111 Jiri Kosina 2014-10-21 11:21:11 UTC
I wanted to play with this a little bit more, but even after reverting the band-aid implemented in ece4a17d23, I haven't been able to reproduce the problem so far with 3.18-rc1+ (HEAD == c2661b8060) after 40 suspend-resume cycles (which is a timeframe when the problem usually triggered in the past).

I'll keep trying to reproduce the problem, but it'd be nice if others who have been able to reproduce the problem would be able to try with current Linus' tree (3.18-rc1 and later) with ece4a17d23 workaround reverted, and report their findings.

Thanks.
Comment 112 Manuel Krause 2014-10-21 22:56:04 UTC
Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my machine. I'll come back later with my re-testing when this is fixed. But I'll definitely do.
Comment 113 Chris Wilson 2014-11-09 19:03:02 UTC
*** Bug 86067 has been marked as a duplicate of this bug. ***
Comment 114 Daniel Vetter 2014-11-21 08:56:56 UTC
(In reply to Manuel Krause from comment #112)
> Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my
> machine. I'll come back later with my re-testing when this is fixed. But
> I'll definitely do.

We're later in -rc ... any updates on the state of ring init on g4x?
Comment 115 Manuel Krause 2014-11-22 22:31:26 UTC
(In reply to Daniel Vetter from comment #114)
> (In reply to Manuel Krause from comment #112)
> > Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my
> > machine. I'll come back later with my re-testing when this is fixed. But
> > I'll definitely do.
> 
> We're later in -rc ... any updates on the state of ring init on g4x?

I'm still beeing hit by another BUG that prevents me to test this one here under real world conditions. With a 3.18-rc5 I'd get the following already during booting, after which no Xv video playback is possible anymore while glxgears works:

from dmesg:

[drm:intel_pipe_config_compare] *ERROR* mismatch in pipe_src_w (expected 0, found 4096)
[   36.653126] ------------[ cut here ]------------
[   36.653211] WARNING: CPU: 0 PID: 712 at drivers/gpu/drm/i915/intel_display.c:10966 check_crtc_state+0x7b3/0x1010 [i915]()
[   36.653215] pipe state doesn't match!
[   36.653218] Modules linked in: nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv4 iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables vboxdrv(O) xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables fuse snd_hda_codec_hdmi snd_hda_codec_analog snd_hda_codec_generic coretemp kvm_intel kvm hp_wmi sparse_keymap rfkill iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_seq microcode snd_timer snd_seq_device joydev serio_raw snd lpc_ich mfd_core tg3 libphy ptp pps_core soundcore wmi battery tpm_infineon
[   36.653305]  tpm_tis tpm hp_accel lis3lv02d input_polldev ac evdev acpi_cpufreq sg loop dm_mod ipv6 autofs4 btrfs raid6_pq xor i915 drm_kms_helper drm video i2c_algo_bit button
[   36.653339] CPU: 0 PID: 712 Comm: Xorg Tainted: G           O   3.18.0-rc5-vanilla #1
[   36.653344] Hardware name: Hewlett-Packard HP Compaq 6730b (KU489ET#ABD)/30DD, BIOS 68PDD Ver. F.20 12/07/2011
[   36.653348]  0000000000000009 ffff8800b7c13848 ffffffff814bcd76 0000000000000000
[   36.653356]  ffff8800b7c13898 ffff8800b7c13888 ffffffff8103c347 ffff8800b7c13890
[   36.653362]  ffff880139da4060 ffff880139057000 ffff880139130000 ffff880139057330
[   36.653370] Call Trace:
[   36.653386]  [<ffffffff814bcd76>] dump_stack+0x4e/0x71
[   36.653395]  [<ffffffff8103c347>] warn_slowpath_common+0x77/0xa0
[   36.653402]  [<ffffffff8103c3b1>] warn_slowpath_fmt+0x41/0x50
[   36.653463]  [<ffffffffa012ac91>] ? intel_lvds_get_config+0x41/0xe0 [i915]
[   36.653515]  [<ffffffffa00f5e73>] check_crtc_state+0x7b3/0x1010 [i915]
[   36.653525]  [<ffffffff81063638>] ? dequeue_task_fair+0x368/0x4b0
[   36.653581]  [<ffffffffa010592f>] intel_modeset_check_state+0x27f/0x790 [i915]
[   36.653634]  [<ffffffffa0105ed0>] intel_set_mode+0x20/0x30 [i915]
[   36.653687]  [<ffffffffa0106e7c>] intel_crtc_set_config+0x91c/0xe40 [i915]
[   36.653732]  [<ffffffffa002af01>] drm_mode_set_config_internal+0x61/0xf0 [drm]
[   36.653770]  [<ffffffffa002f424>] drm_mode_setcrtc+0xd4/0x590 [drm]
[   36.653800]  [<ffffffffa00217ac>] drm_ioctl+0x19c/0x630 [drm]
[   36.653814]  [<ffffffff8111ee80>] do_vfs_ioctl+0x2e0/0x4c0
[   36.653822]  [<ffffffff8111f0e1>] SyS_ioctl+0x81/0xa0
[   36.653831]  [<ffffffff814c2bd6>] system_call_fastpath+0x16/0x1b
[   36.653837] ---[ end trace 41cc44d460ddfb2f ]---
[   36.655549] [drm:intel_pipe_config_compare] *ERROR* mismatch in pipe_src_w (expected 0, found 4096)
[   36.655556] ------------[ cut here ]------------
[   36.655618] WARNING: CPU: 0 PID: 712 at drivers/gpu/drm/i915/intel_display.c:10966 check_crtc_state+0x7b3/0x1010 [i915]()
[   36.655622] pipe state doesn't match!
[   36.655625] Modules linked in: nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv4 iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables vboxdrv(O) xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables fuse snd_hda_codec_hdmi snd_hda_codec_analog snd_hda_codec_generic coretemp kvm_intel kvm hp_wmi sparse_keymap rfkill iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_seq microcode snd_timer snd_seq_device joydev serio_raw snd lpc_ich mfd_core tg3 libphy ptp pps_core soundcore wmi battery tpm_infineon
[   36.655709]  tpm_tis tpm hp_accel lis3lv02d input_polldev ac evdev acpi_cpufreq sg loop dm_mod ipv6 autofs4 btrfs raid6_pq xor i915 drm_kms_helper drm video i2c_algo_bit button
[   36.655740] CPU: 0 PID: 712 Comm: Xorg Tainted: G        W  O   3.18.0-rc5-vanilla #1
[   36.655745] Hardware name: Hewlett-Packard HP Compaq 6730b (KU489ET#ABD)/30DD, BIOS 68PDD Ver. F.20 12/07/2011
[   36.655749]  0000000000000009 ffff8800b7c13848 ffffffff814bcd76 0000000000000000
[   36.655756]  ffff8800b7c13898 ffff8800b7c13888 ffffffff8103c347 ffff8800b7c13890
[   36.655763]  ffff880139da4060 ffff880139057000 ffff880139130000 ffff880139057330
[   36.655770] Call Trace:
[   36.655782]  [<ffffffff814bcd76>] dump_stack+0x4e/0x71
[   36.655790]  [<ffffffff8103c347>] warn_slowpath_common+0x77/0xa0
[   36.655797]  [<ffffffff8103c3b1>] warn_slowpath_fmt+0x41/0x50
[   36.655860]  [<ffffffffa012ac91>] ? intel_lvds_get_config+0x41/0xe0 [i915]
[   36.655913]  [<ffffffffa00f5e73>] check_crtc_state+0x7b3/0x1010 [i915]
[   36.655971]  [<ffffffffa010592f>] intel_modeset_check_state+0x27f/0x790 [i915]
[   36.656010]  [<ffffffffa0105ed0>] intel_set_mode+0x20/0x30 [i915]
[   36.656114]  [<ffffffffa0106cc1>] intel_crtc_set_config+0x761/0xe40 [i915]
[   36.656156]  [<ffffffffa002af01>] drm_mode_set_config_internal+0x61/0xf0 [drm]
[   36.656194]  [<ffffffffa002f424>] drm_mode_setcrtc+0xd4/0x590 [drm]
[   36.656224]  [<ffffffffa00217ac>] drm_ioctl+0x19c/0x630 [drm]
[   36.656239]  [<ffffffff8111ee80>] do_vfs_ioctl+0x2e0/0x4c0
[   36.656247]  [<ffffffff8111f0e1>] SyS_ioctl+0x81/0xa0
[   36.656256]  [<ffffffff814c2bd6>] system_call_fastpath+0x16/0x1b
[   36.656261] ---[ end trace 41cc44d460ddfb30 ]---
[   36.691044] [drm:i9xx_set_fifo_underrun_reporting] *ERROR* pipe A underrun
[   37.153072] [drm:i9xx_set_fifo_underrun_reporting] *ERROR* pipe A underrun
[   37.153084] [drm:i965_irq_handler] *ERROR* pipe A underrun

from Xorg.0.log:
[    36.043] (II) AIGLX: Loaded and initialized i965
[    36.043] (II) GLX: Initialized DRI2 GL provider for screen 0
[    36.067] (WW) intel(0): Failed to submit rendering commands, trying again with outputs disabled.
[    36.655] (EE) intel(0): unable to attach scanout
[    36.655] (EE) intel(0): Failed to submit rendering commands, disabling acceleration.


This may have nothing to do with this bug here, but I'd be glad if someone of you could lead me to an existing related bugreport (google didn't help) or even a solution.

Normally I'm now using a 3.17.4 kernel (without my reverting patch) -- and I still experience this bug here regularily (approx. twice in three reboots and if hibernation correctly worked once, it'll work on) but it stays unpredictable. I had been thinking that it also depends on the userspace Xorg/intel parts for long time, but I'm not sure about that anymore, as I can't prove it over several months of regularily actualising these parts of my openSUSE.

Best regards, I hope my info helps a bit,
Manuel Krause
Comment 116 Chris Wilson 2014-11-24 09:11:54 UTC
(In reply to Manuel Krause from comment #115)
> from Xorg.0.log:
> [    36.043] (II) AIGLX: Loaded and initialized i965
> [    36.043] (II) GLX: Initialized DRI2 GL provider for screen 0
> [    36.067] (WW) intel(0): Failed to submit rendering commands, trying
> again with outputs disabled.
> [    36.655] (EE) intel(0): unable to attach scanout
> [    36.655] (EE) intel(0): Failed to submit rendering commands, disabling
> acceleration.

That's PIN_BIAS fallout.
Comment 117 Manuel Krause 2014-11-25 18:32:14 UTC
(In reply to Chris Wilson from comment #116)
> (In reply to Manuel Krause from comment #115)
> > from Xorg.0.log:
> > [    36.043] (II) AIGLX: Loaded and initialized i965
> > [    36.043] (II) GLX: Initialized DRI2 GL provider for screen 0
> > [    36.067] (WW) intel(0): Failed to submit rendering commands, trying
> > again with outputs disabled.
> > [    36.655] (EE) intel(0): unable to attach scanout
> > [    36.655] (EE) intel(0): Failed to submit rendering commands, disabling
> > acceleration.
> 
> That's PIN_BIAS fallout.

I don't know what to do with this information. :-( 
Kernel 3.18-rc6 has the same issue -- is there something I can do against it ?
Comment 118 Daniel Vetter 2014-11-26 08:29:05 UTC
(In reply to Manuel Krause from comment #117)
> (In reply to Chris Wilson from comment #116)
> > (In reply to Manuel Krause from comment #115)
> > > from Xorg.0.log:
> > > [    36.043] (II) AIGLX: Loaded and initialized i965
> > > [    36.043] (II) GLX: Initialized DRI2 GL provider for screen 0
> > > [    36.067] (WW) intel(0): Failed to submit rendering commands, trying
> > > again with outputs disabled.
> > > [    36.655] (EE) intel(0): unable to attach scanout
> > > [    36.655] (EE) intel(0): Failed to submit rendering commands, disabling
> > > acceleration.
> > 
> > That's PIN_BIAS fallout.
> 
> I don't know what to do with this information. :-( 
> Kernel 3.18-rc6 has the same issue -- is there something I can do against it
> ?

http://www.spinics.net/lists/stable/msg71063.html should address this one - it's trickling through the queues (probably with a detour through 3.19).
Comment 119 Manuel Krause 2014-11-26 19:15:59 UTC
(In reply to Daniel Vetter from comment #118)
> (In reply to Manuel Krause from comment #117)
> > (In reply to Chris Wilson from comment #116)
> > > (In reply to Manuel Krause from comment #115)
> > > > from Xorg.0.log:
> > > > [    36.043] (II) AIGLX: Loaded and initialized i965
> > > > [    36.043] (II) GLX: Initialized DRI2 GL provider for screen 0
> > > > [    36.067] (WW) intel(0): Failed to submit rendering commands, trying
> > > > again with outputs disabled.
> > > > [    36.655] (EE) intel(0): unable to attach scanout
> > > > [    36.655] (EE) intel(0): Failed to submit rendering commands, disabling
> > > > acceleration.
> > > 
> > > That's PIN_BIAS fallout.
> > 
> > I don't know what to do with this information. :-( 
> > Kernel 3.18-rc6 has the same issue -- is there something I can do against it
> > ?
> 
> http://www.spinics.net/lists/stable/msg71063.html should address this one -
> it's trickling through the queues (probably with a detour through 3.19).

Thank you very very much!! This patch heals the above mentioned issue.

So, if I see it correctly, the current testing on this bug is suspend-/resuming on 3.18-rc with ece4a17d23 workaround reverted -- to see if the symptoms reoccur at all? 
How many iterations are considered as senseful or sufficient?
Comment 120 Manuel Krause 2014-12-01 23:05:08 UTC
I hope, that the patch cited in Comment 118 will get included as soon as possible, as this enabled my testing (or even more, would enable me using 3.18?).

For the 3.18.0-rc6 with the workaround I've done:
5 hibernates+resumes /reboot same kernel/ 
5 hibernates+resumes /reboot same kernel/ 
1 hibernation+resume: 
No issues.

For the 3.18.0-rc6 WITHOUT the workaround I've done:
5 hibernates+resumes /reboot to my 3.17.4 everyday-kernel/ hibernate+resume/ reboot to 3.18.0-rc6 WITHOUT the workaround, again, 
and this row done 4 times.
No issues.

I know these tests don't go up to Jiri's 40 suspend/resume count, but my ones have been done under real world conditions (loading applications, playing video, etc.).

Can someone of you explain, what makes 3.18 that different from 3.17 in this case? It would be nice to see the related improvements to be backported to 3.17.

Thank you in advance, Manuel
Comment 121 Jani Nikula 2015-01-29 10:09:11 UTC
(In reply to Manuel Krause from comment #120)
> I hope, that the patch cited in Comment 118 will get included as soon as
> possible, as this enabled my testing (or even more, would enable me using
> 3.18?).

That's in v3.18.4.
Comment 122 Chris Wilson 2015-02-09 13:01:01 UTC
The ring init appears to have been fixed in v3.18, at least there has been no further reports since we merged the w/a patch.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.