This is not reliably reproducible, so not possible to do proper bisect. There are suspend-resume cycles which end up with Xorg misbehaving (graphics not redrawing properly, etc), and dmesg contains [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head ffffff8804 tail 00000000 start 000e4000 This is both with current Linus' tree (HEAD 774868c70) and the issue is still present even after merging drm-intel-testing (HEAD 14347c2) into it. There is currently ongoing mailinglist discussion here: https://lkml.org/lkml/2014/2/27/183
OpenGL should be dead after resume, but the DDX should still behave -- everything should be accessible following a failed resume. Can you please attach your Xorg.0.log after such a failed resume?
Created attachment 96294 [details] [review] Move all ring resets before setting the HWS patch Out of curiousity, can you try?
(In reply to comment #2) > Created attachment 96294 [details] [review] [review] > Move all ring resets before setting the HWS patch > > Out of curiousity, can you try? This actually seems to make things substantially worse -- out of two suspend-resume cycles with the kernel that had this patch applied (on top of drm-intel-testing), in both cases the issue triggered.
Created attachment 96298 [details] Xorg.0.log from the broken resume (with Chris' patch applied on top of drm-intel-testing).
Hmm, a piece of UXA state became corrupt (likely an invalid fb object or something). How does SNA fare? In particular, we can then run the DDX with --enable-debug=full to see what goes wrong. Or we might be able to spot it from a drm.debug=7 dmesg.
As for the kernel patch, that's weird... Presumably it is then the order in which the ring registers are written.
Created attachment 96304 [details] [review] Explicitly stop the rings before resetting One last idea to try on top of the previous patch is to wait for ring-idle first.
Created attachment 96305 [details] dmesg with drm.debug=7 Attached is a dmesg with drm.debug=7 from the resume that had the problem (had to gzip it due to size). I had to increase ringbuffer size due to the flood of WARNING: CPU: 1 PID: 111 at drivers/gpu/drm/drm_modes.c:119 drm_mode_probed_add+0x51/0x60 [drm]() which are new since I merged drm-intel-testing -- those are not there with Linus' tree, but the ringbuffer issue still happens. I will report the WARNs separately later.
The dmesg from comment#8 is from kernel that didn't yet have patch from comment#7 applied. I will be testing that ASAP, thanks.
[ 45.141243] [drm:drm_ioctl], pid=1519, dev=0xe200, auth=1, I915_GEM_EXECBUFFER2 [ 45.141256] [drm:i915_gem_do_execbuffer], execbuf with invalid ring: 0 [ 45.141260] [drm:drm_ioctl], ret = -22 Wow.
Created attachment 96307 [details] [review] Mark device as wedged if we fail to resume This should help UXA to render correctly following the resume failure.
Created attachment 96311 [details] dmesg-2 Unfortunately the patch from comment #11 didn't help either. Attaching dmesg of the failure with the patch applied.
Hmm, UXA is being aggressively dumb. It even gets told the GPU is wedged, but ignores it. The patch did the right thing, but UXA is still not able to notice since it doesn't check for errors when it should.
Created attachment 96323 [details] [review] Report EIO after resume failure in execbuffer
Created attachment 96406 [details] [review] Preserve ring buffers across resume Another patch to apply on top of the first 3.
(In reply to comment #15) > Created attachment 96406 [details] [review] [review] > Preserve ring buffers across resume > > Another patch to apply on top of the first 3. What tree is this patch against please? I am getting rejects in drivers/gpu/drm/i915/intel_ringbuffer.c both in Linus' tree and in drm-intel-next branch of drm-intel tree.
I've rebased the patches against drm-intel-nightly so they should apply to most recent kernel trees: http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug76554
Built a kernel pulled from git://people.freedesktop.org/~ickle/linux-2.6 bug76554 with topmost commit being commit 1318add417cf6c9dba373393e5b7be62e3283c84 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Mar 24 19:17:11 2014 +0000 drm/i915: Allow the module to load even if we fail to setup rings but unfortunately the symptoms on resume from hibernation are exactly still the same.
For reference, can you please attach the drm.debug=7 from the branch across resume? At the least it should have prevented UXA from freezing. :|
Created attachment 96776 [details] drm.debug=7 dmesg with patched kernel Attached is drm.debug=7 dmesg demonstrating the problem happening everything from git://people.freedesktop.org/~ickle/linux-2.6 bug76554 (HEAD == 1318add417c) applied.
Ah, oops missed a patch from that branch to prevent the execbuffer from quietly suceeding. That explains why UXA kept on failing, but not why the rings still will not restart.
One more random rearrangement that should apply on top of that branch: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 602432eaf346..bbcd6b5446f3 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -456,9 +456,9 @@ static bool stop_ring(struct intel_ring_buffer *ring) } } + I915_WRITE_CTL(ring, 0); I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0); - I915_WRITE_CTL(ring, 0); if (!IS_GEN2(ring->dev)) { (void)I915_READ_CTL(ring); @@ -513,18 +513,19 @@ static int init_ring_common(struct intel_ring_buffer *ring) I915_WRITE_CTL(ring, ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); + I915_WRITE_HEAD(ring, 0); + ring->write_tail(ring, 0); /* If the head is still not zero, the ring is dead */ if (wait_for((I915_READ_CTL(ring) & RING_VALID) != 0 && I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj) && (I915_READ_HEAD(ring) & HEAD_ADDR) == 0, 50)) { DRM_ERROR("%s initialization failed " - "ctl %08x head %08x tail %08x start %08x\n", - ring->name, - I915_READ_CTL(ring), - I915_READ_HEAD(ring), - I915_READ_TAIL(ring), - I915_READ_START(ring)); + "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08x]\n", + ring->name, + I915_READ_CTL(ring), I915_READ_CTL(ring) & RING_VALID, + I915_READ_HEAD(ring), I915_READ_TAIL(ring), + I915_READ_START(ring), i915_gem_obj_ggtt_offset(obj)); ret = -EIO; goto out; } You may also want to cherry-pick ec9da60002b2390a3932db36d61d1d4e30c4ee21 from the bug76554 branch to prevent uxa from freezing.
I refetched the git branch to manually apply the reordering patch on top of it (bugzilla is damaging it, could you please attach it next time? thanks), but the branch doesn't build any more: drivers/gpu/drm/i915/intel_ringbuffer.c: In function ‘stop_ring’: drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: ‘drm_i915_private_t’ undeclared (first use in this function) drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: (Each undeclared identifier is reported only once drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: for each function it appears in.) drivers/gpu/drm/i915/intel_ringbuffer.c:444: error: ‘dev_priv’ undeclared (first use in this function) make[2]: *** [drivers/gpu/drm/i915/intel_ringbuffer.o] Error 1 make[2]: *** Waiting for unfinished jobs.... drivers/gpu/drm/i915/i915_gem.c: In function ‘i915_gem_stop_ringbuffers’: drivers/gpu/drm/i915/i915_gem.c:4240: error: ‘drm_i915_private_t’ undeclared (first use in this function) drivers/gpu/drm/i915/i915_gem.c:4240: error: (Each undeclared identifier is reported only once drivers/gpu/drm/i915/i915_gem.c:4240: error: for each function it appears in.) drivers/gpu/drm/i915/i915_gem.c:4240: error: ‘dev_priv’ undeclared (first use in this function) drivers/gpu/drm/i915/i915_gem.c:4241: warning: ISO C90 forbids mixed declarations and code drivers/gpu/drm/i915/i915_gem.c:4244: warning: left-hand operand of comma expression has no effect make[2]: *** [drivers/gpu/drm/i915/i915_gem.o] Error 1 make[1]: *** [drivers/gpu/drm/i915] Error 2 make: *** [drivers/gpu/drm/] Error 2 Topmost commit of the branch is commit ec9da60002b2390a3932db36d61d1d4e30c4ee21 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Mar 24 17:56:36 2014 +0000
Bleh, rebase error. All suggested patches are now up on #bug76554.
#bug76554 head is currently commit cfa8aaa35f180268c99e72964228c944930af680 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Apr 2 13:37:24 2014 +0100
Created attachment 96783 [details] drm.debug=7 dmesg with patched kernel (cfa8aaa3) With the branch that has cfa8aaa3 as a topmost commit, the ring initialization failures are still popping up on resume, but Xorg rendering turning into complete mess is finally solved, and the Xorg session is not corrupted and works! (althrough it feels like the whole things is slower, but that might be due to excessive logging going on). dmesg with drm.debug=7 attached. So if you are going to push anything of this upstream, please feel free to add my Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> to it, although I assume the ring initialization failure still needs to be solved ... ? Thanks!
(In reply to comment #26) > Created attachment 96783 [details] > drm.debug=7 dmesg with patched kernel (cfa8aaa3) > > With the branch that has cfa8aaa3 as a topmost commit, the ring > initialization failures are still popping up on resume, but Xorg rendering > turning into complete mess is finally solved, and the Xorg session is not > corrupted and works! (althrough it feels like the whole things is slower, > but that might be due to excessive logging going on). Indeed. What happens is that UXA now finally detects that the kernel is reporting that it cannot execute GPU commands, and instead it falls back to CPU rendering directly into the framebuffer. > dmesg with drm.debug=7 attached. > > So if you are going to push anything of this upstream, please feel free to > add my > > Reported-and-tested-by: Jiri Kosina <jkosina@suse.cz> > > to it, although I assume the ring initialization failure still needs to be > solved ... ? Yes. We never knew why g45 failed in the first place, if we can figure out what changed now, we may be able to create a better band-aid.
(In reply to comment #27) > Indeed. What happens is that UXA now finally detects that the kernel is > reporting that it cannot execute GPU commands, and instead it falls back to > CPU rendering directly into the framebuffer. Understood, thanks. So kernel should probably put a huge warning into dmesg once such condition is detected and workaround applied. > > to it, although I assume the ring initialization failure still needs to be > > solved ... ? > > Yes. We never knew why g45 failed in the first place, if we can figure out > what changed now, we may be able to create a better band-aid. Excellent, thanks. Happy to test any diag patches necessary.
BTW, may I kindly ask you what your plans with those patches are? Although it's clear that root-causing the ring initialization failures is still the priority, without having this kind of bandaid present in the Linus' tree, it's almost completely useless on my system. Thanks.
The temporary fix is on its way upstream (under review atm), as keeping the system limping along is essential.
Now that the proper fallback handling is on track, have we attempted to bisect where the underlying root-cause (ring init failure on resume) was made much worse? I guess on some older kernels this worked better. No guarantee that it'll help since this gm45 ring init issue is really ellusive, but it might shed some light on what's going on.
(In reply to comment #31) > Now that the proper fallback handling is on track, have we attempted to > bisect where the underlying root-cause (ring init failure on resume) was > made much worse? I guess on some older kernels this worked better. I am afraid this is close to impossible. The frequency of the problem happening fluctuates *a lot* between different kernel. - I am pretty sure that I've *never ever* seen it happening on 3.7 kernel, and it has been excercised a lot on the system in question - Around 3.13, this seems to happen in a rather "time to time" manner (say once in 40 resumes, but with rather large standard deviation) - with current Linus' tree and with the drm tree as well, this happens super-reliably on almost every resume from hibernation I don't have enough data from the kernels in between to be able claim the ratio reliably. I am afraid this pretty much implies that bisecting this reliably would consume incredible amount of time and might still produce unreliable result.
Created attachment 97027 [details] [review] Print ring registers for debugging I think this might help in working out what the values in the registers mean. I think it is sticking to the old value, but I am not sure, hence the patch.
Created attachment 97034 [details] dmesg with ring contets dump before/after initialization This is a dmesg from resume where ring initialization fails with all the patches (including the before/after ring contents dump) posted here so far applied.
That's scary. The immediate read of RING_HEAD after it returned 0 during the first initialisation returns a non-zero value... It only just barely passed the self-checks during module load. Just as importantly, it did not have the pattern I was expecting. I think we should try emitting a dummy command and seeing if the CS ring updates.
Created attachment 97035 [details] [review] Poke the ring to see if it is awake Maybe this is enough to see if the ring responds correctly. Please keep the ring debug patch in place.
Created attachment 97036 [details] dmesg with ring contents dump and MI_NOOP writes issued Unfortunately the error is still there even with the MI_NOOP writes. dmesg with that (and all the previous patches) applied is attached.
So, is there anything else I should try, given that bisecting is not really a viable option here, please? It's rather annoying bug and it's my intention to help as much as possible to have it sorted out.
Hmm. I missed that the "after initialisation" printk is correct. So perhaps all we need is to wait a little longer... diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 2eb85cc2062f..5a74986348c6 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -537,7 +537,7 @@ static int init_ring_common(struct intel_ring_buffer *ring) /* If the head is still not zero, the ring is dead */ if (wait_for((I915_READ_CTL(ring) & RING_VALID) != 0 && I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj) && - (I915_READ_HEAD(ring) & HEAD_ADDR) == 8, 50)) { + (I915_READ_HEAD(ring) & HEAD_ADDR) == 8, 1000)) { DRM_ERROR("%s initialization failed " "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx]\n", ring->name,
(In reply to comment #39) > Hmm. I missed that the "after initialisation" printk is correct. So perhaps > all we need is to wait a little longer... Unfortunately the symptoms are still the same even with timeout == 1000: [ 54.108192] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 000e299c tail 00000008 start 000e4000 [expected 000e4000] [ 54.108201] Ring render ring after initialisation: 0001f001 000e299c 00000008 000e4000
One last paste... (Apologies for any white space issues, this is just trying to be quick and dirty.) diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 5a74986348c6..75365c1588fb 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -526,9 +526,28 @@ static int init_ring_common(struct intel_ring_buffer *ring) * also enforces ordering), otherwise the hw might lose the new ring * register values. */ I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); + if (wait_for(I915_READ_START(ring) == i915_gem_obj_ggtt_offset(obj), + 1000)) { + DRM_ERROR("%s initialization failed " + "start %08x [expected %08lx]\n", + ring->name, + I915_READ_START(ring), + (unsigned long)i915_gem_obj_ggtt_offset(obj)); + ret = -EIO; + goto out; + } + I915_WRITE_CTL(ring, ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); + if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) { + DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n", + ring->name, + I915_READ_CTL(ring), + !!(I915_READ_CTL(ring) & RING_VALID)); + ret = -EIO; + goto out; + }
Created attachment 97739 [details] dmesg with all the patches up to now applied Attaching dmesg with all patches (up to and including the one in comment #41) included with the error condition triggering.
If it keeps resetting HEAD to a random value after switching the ring on, how does it ever work? :| Another hack: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 5a74986348c6..e47324aa8963 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -530,6 +530,13 @@ static int init_ring_common(struct intel_ring_buffer *ring) ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); + if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) { + printk(KERN_ERR "%s initialization failed [%08x != %08x], fudging\n", + ring->name, I915_READ_START(ring), i915_gem_obj_ggtt_offset(obj)); + I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); + POSTING_READ(ring); + } + iowrite32(MI_NOOP, ring->virtual_start + 0); iowrite32(MI_NOOP, ring->virtual_start + 4); ring->write_tail(ring, 8);
diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 5a74986348c6..b46b3e928a7f 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer *ring) ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); + if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) { + printk(KERN_ERR + "%s initialization failed" + " [%08x != %08x], fudging\n", + ring->name, + I915_READ_START(ring), + i915_gem_obj_ggtt_offset(obj)); + I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); + POSTING_READ(ring); + } + iowrite32(MI_NOOP, ring->virtual_start + 0); iowrite32(MI_NOOP, ring->virtual_start + 4); ring->write_tail(ring, 8);
(In reply to comment #44) > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c > b/drivers/gpu/drm/i915/intel_ringbuffer.c > index 5a74986348c6..b46b3e928a7f 100644 > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > @@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer > *ring) > ((ring->size - PAGE_SIZE) & RING_NR_PAGES) > | RING_VALID); > > + if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) { > + printk(KERN_ERR > + "%s initialization failed" > + " [%08x != %08x], fudging\n", > + ring->name, > + I915_READ_START(ring), > + i915_gem_obj_ggtt_offset(obj)); > + I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); > + POSTING_READ(ring); > + } > + > iowrite32(MI_NOOP, ring->virtual_start + 0); > iowrite32(MI_NOOP, ring->virtual_start + 4); > ring->write_tail(ring, 8); What is a baseline I should apply this on top of, please? The surrounding code in my tree (with all the patches provided so far apples) is [ ... ] I915_WRITE_CTL(ring, ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) { DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n", ring->name, I915_READ_CTL(ring), !!(I915_READ_CTL(ring) & RING_VALID)); ret = -EIO; goto out; } I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0); iowrite32(MI_NOOP, ring->virtual_start + 0); iowrite32(MI_NOOP, ring->virtual_start + 4); ring->write_tail(ring, 8); [ ... ] (i.e. it has the extra I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0);, etc). I can of course easily apply the hunk just between the ring->write_tail(ring, 0); and iowrite32(MI_NOOP, ring->virtual_start + 0); if that's what you want me to do. Thanks.
(In reply to comment #45) > (In reply to comment #44) > > diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c > > b/drivers/gpu/drm/i915/intel_ringbuffer.c > > index 5a74986348c6..b46b3e928a7f 100644 > > --- a/drivers/gpu/drm/i915/intel_ringbuffer.c > > +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c > > @@ -530,6 +530,17 @@ static int init_ring_common(struct intel_ring_buffer > > *ring) > > ((ring->size - PAGE_SIZE) & RING_NR_PAGES) > > | RING_VALID); > > > > + if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) { > > + printk(KERN_ERR > > + "%s initialization failed" > > + " [%08x != %08x], fudging\n", > > + ring->name, > > + I915_READ_START(ring), > > + i915_gem_obj_ggtt_offset(obj)); > > + I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); > > + POSTING_READ(ring); > > + } > > + > > iowrite32(MI_NOOP, ring->virtual_start + 0); > > iowrite32(MI_NOOP, ring->virtual_start + 4); > > ring->write_tail(ring, 8); > > What is a baseline I should apply this on top of, please? The surrounding > code in my tree (with all the patches provided so far apples) is > > [ ... ] > I915_WRITE_CTL(ring, > ((ring->size - PAGE_SIZE) & RING_NR_PAGES) > | RING_VALID); > if (wait_for(I915_READ_CTL(ring) & RING_VALID, 1000)) { > DRM_ERROR("%s initialization failed ctl %08x (valid? %d)\n", > ring->name, > I915_READ_CTL(ring), > !!(I915_READ_CTL(ring) & RING_VALID)); > ret = -EIO; > goto out; > } > I915_WRITE_HEAD(ring, 0); > ring->write_tail(ring, 0); > > iowrite32(MI_NOOP, ring->virtual_start + 0); > iowrite32(MI_NOOP, ring->virtual_start + 4); > ring->write_tail(ring, 8); > [ ... ] > > (i.e. it has the extra I915_WRITE_HEAD(ring, 0); ring->write_tail(ring, 0);, > etc). > > I can of course easily apply the hunk just between the > > ring->write_tail(ring, 0); > > and > > iowrite32(MI_NOOP, ring->virtual_start + 0); > > if that's what you want me to do. > > Thanks. Sorry, I threw away the preceding hack to try and keep the diff clean. Just plonk the write to set HEAD again after setting CTRL (and the wait_for(CTRL) if you have that). Hmm, it appears we have drifted slightly in our assortment of patches, let me push my current collection of hacks so we can rebase.
Latest set of hacks and patches on top of drm-intel-nightly: http://cgit.freedesktop.org/~ickle/linux-2.6/commit/?h=bug76554
Created attachment 97744 [details] dmesg with HEAD==218bb0e7f The problem is still there with the referenced branch (SHA1 HEAD 218bb0e7f). Dmesg attached.
So it passes the immediate check that HEAD is valid after setting CTRL, but then fails shortly afterwards. Humph. I am not sure what is going on! I wonder if it is as simple as the combination of reads failing?
The problematic condition causing the whole ring to be claimed dead is I915_READ_HEAD(ring) & HEAD_ADDR) == 8 right? I915_READ_HEAD(ring) returns 000e200c HEAD_ADDR is 0x001FFFFC, so the result is e200c, not the expected value of 8, causing the ring initialization failure. Or am I completely wrong here?
(In reply to comment #50) I probably wasn't super clear what I was referring to by this comment: > The problematic condition causing the whole ring to be claimed dead is > > I915_READ_HEAD(ring) & HEAD_ADDR) == 8 > > right? > > I915_READ_HEAD(ring) returns 000e200c HEAD_ADDR is 0x001FFFFC, so the result > is e200c, not the expected value of 8, causing the ring initialization > failure. I was referring to this: (In reply to comment #49) > So it passes the immediate check that HEAD is valid after setting CTRL, but > then fails shortly afterwards. Humph. I am not sure what is going on! because I don't see any check for HEAD validity after settin CTRL; I only see I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj) check, but no I915_READ_HEAD() check ... but obviously, I am absolutely unfamiliar with this code, so sorry for creating unnecessary noise likely. > > I wonder if it is as simple as the combination of reads failing?
No, it is just me getting confused between HEAD and START. Ok, I wonder if this is the missing piece of magic (on top of the current bug branch): diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index b46b3e928a7f..12c59e945f8e 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -530,15 +530,14 @@ static int init_ring_common(struct intel_ring_buffer *ring) ((ring->size - PAGE_SIZE) & RING_NR_PAGES) | RING_VALID); - if (I915_READ_START(ring) != i915_gem_obj_ggtt_offset(obj)) { + if (I915_READ_HEAD(ring)) { printk(KERN_ERR "%s initialization failed" - " [%08x != %08x], fudging\n", + " [head now %08x], fudging\n", ring->name, - I915_READ_START(ring), - i915_gem_obj_ggtt_offset(obj)); - I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); - POSTING_READ(ring); + I915_READ_HEAD(ring)); + I915_WRITE_HEAD(ring, 0); + (void)I915_READ_HEAD(ring); } iowrite32(MI_NOOP, ring->virtual_start + 0);
Created attachment 97747 [details] dmesg with fixed start/head On the first resume, the issue didn't occur, but second suspend-resume cycle revealed it again. dmesg attached.
After the first resume, we applied the fixup. After the second resume, it managed to get past the check and then failed. /o\
Created attachment 97756 [details] [review] Retry ring initialisation And another hack!
Created attachment 97774 [details] dmesg with retry-patch applied dmesg with patch from comment#55 applied on top of the previous pile. The only notable difference seems to be appearance of [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring during resume.
Is there anything new on this front, please?
I haven't had any other inspiration. Maybe, diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 401f3e7..ccb0e5c 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -513,6 +513,7 @@ reset: * registers with the above sequence (the readback of the HEAD registers * also enforces ordering), otherwise the hw might lose the new ring * register values. */ + memset(ring->virtual_start, 0, ring->size); I915_WRITE_START(ring, i915_gem_obj_ggtt_offset(obj)); I915_WRITE_CTL(ring, ((ring->size - PAGE_SIZE) & RING_NR_PAGES)
With that patch in place (on top of all previous patches), this is still in dmesg upon resume: [ 30.584016] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring [ 30.584021] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f401 (valid? 1) head 000e202c tail 00000008 start 000e4000 [expected 000e4000] [ 30.584024] Ring render ring after initialisation: 0001f401 000e202c 00000008 000e4000 [ 30.584034] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged!
No good ideas here either, but would be nice to see if this makes a difference on ring init: diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 4024e16..708a1da 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -573,6 +573,15 @@ static int i915_drm_thaw_early(struct drm_device *dev) static int __i915_drm_thaw(struct drm_device *dev, bool restore_gtt_mappings) { struct drm_i915_private *dev_priv = dev->dev_private; + int ret; + + mutex_lock(&dev->struct_mutex); + ret = intel_gpu_reset(dev); + mutex_unlock(&dev->struct_mutex); + + if (ret) + DRM_ERROR("failed to reset the GPU on resume (%d), ignoring\n", + ret);
(In reply to comment #60) > No good ideas here either, but would be nice to see if this makes a > difference on > ring init: Even with the memset() patch from comment#60 applied on top of the previous bunch, I see this on resume: [ 54.300012] [drm:stop_ring] *ERROR* render ring :timed out trying to stop ring [ 54.300018] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f401 (valid? 1) head 000e252c tail 00000008 start 000e4000 [expected 000e4000] [ 54.300021] Ring render ring after initialisation: 0001f401 000e252c 00000008 000e4000 [ 54.300031] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged!
(In reply to comment #61) > (In reply to comment #60) > > No good ideas here either, but would be nice to see if this makes a > > difference on > > ring init: > > Even with the memset() memset() here should actually read intel_gpu_reset(), sorry for the confusion.
Assigning to Chris since he seems to be all over it.
Created attachment 99172 [details] [review] Prevent updating the HWS whilst it is active Stumbled across this. Probably irrelevant, but it is in the right area.
*** Bug 77977 has been marked as a duplicate of this bug. ***
(In reply to comment #64) > Created attachment 99172 [details] [review] [review] > Prevent updating the HWS whilst it is active > > Stumbled across this. Probably irrelevant, but it is in the right area. Unfortunately this patch doesn't improve the behavior.
A change somewhere in between 3.14 and 3.15 makes me hit this bug *almost* reliably. Bisecting it took me half a day and ended up pointing at commit [78f2975eec9faff353a6194e854d3d39907bab68 drm/i915]: Move all ring resets before setting the HWS page. As the title is the same as a patch posted here earlier, I suppose it is the exact same patch? It seems like what was meant to be a solution to the problem, actually makes it much worse (and maybe helps to find the root cause of it). If there's anything else I can do, just let me know.
It looks like I have the same problem. After upgrading the kernel to 3.15 / 3.15.1, and after suspend appears in dmesg error: [ 31.496713] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 000009c0 tail 00000000 start 000fd000 [ 31.591596] PM: Device 0000:00:02.0 failed to resume async: error -5 I have G45 - X4500MHD, mesa 10.2.1, xf86-video-intel 2.99.912, libdrm 2.4.54.
Fwiw, http://cgit.freedesktop.org/~ickle/linux-2.6/log/?h=bug76554 has everything we have tried so far.
Here's something you can try on top of that branch: diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c index 9ee4ab306134..4f3397f87152 100644 --- a/drivers/gpu/drm/i915/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/intel_ringbuffer.c @@ -539,13 +539,13 @@ reset: goto reset; DRM_ERROR("%s initialization failed " - "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx]\n", + "ctl %08x (valid? %d) head %08x tail %08x start %08x [expected %08lx], fudging\n", ring->name, I915_READ_CTL(ring), I915_READ_CTL(ring) & RING_VALID, I915_READ_HEAD(ring), I915_READ_TAIL(ring), I915_READ_START(ring), (unsigned long)i915_gem_obj_ggtt_offset(obj)); - ret = -EIO; - goto out; + + ring->write_tail(ring, I915_READ_HEAD(ring) & HEAD_ADDR); } if (!drm_core_check_feature(ring->dev, DRIVER_MODESET)) The idea is to ignore the failure and see if we can program the GPU anyway.
(In reply to comment #70) > Here's something you can try on top of that branch: [ ... snip ... ] > The idea is to ignore the failure and see if we can program the GPU anyway. This made things much worse. X comes back after resume (i.e. the windows get drawed the exactly same way they were laid out during suspend), but afterwards, the system is completely dead. Even ctrl-alt-backspace doesn't kill X session, it's not possible to switch to text console.
Hi, I am currently on 3.16-rc4 and the GPU gets disabled right on load of the i915 module, no need to suspend/wake :-( Xorg seems to draw fine - unaccelerated though, xv is not working, too (as one would expect). The system is a Lenovo T500 with a GM45 chipset. If I can somehow help debug this by providing logs let me know...
Hi, I'm hitting this reliably on every resume on 3.15.5. (libdrm 2.4.54, mesa 10.2.3) relevant dmesg output : juil. 11 17:28:23 Nemmerle kernel: [drm:i965_irq_handler] *ERROR* pipe B underrun juil. 11 17:29:52 Nemmerle kernel: [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 head 00000268 tail 00000000 start 007510 juil. 11 17:29:52 Nemmerle kernel: dpm_run_callback(): pci_pm_resume+0x0/0xb0 returns -5 juil. 11 17:29:52 Nemmerle kernel: PM: Device 0000:00:02.0 failed to resume async: error -5 Kwin crashes really bad after this happens... Not sure if the pipe B underrun is related, as that happens even without resuming. This is on : 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 07) If I can help in any way, let me know...
For me the problem occurs when my Lenovo G550 runs KDE4 with desktop effects on and on battery 00:02.0 VGA compatible controller: Intel Corporation Mobile 4 Series Chipset Integrated Graphics Controller (rev 09)
It seems that the bug is fixed in the kernel 3.16. For me, it works normally again. Archlinux xorg-server 1.16.0-5 xf86-video-intel 2.99.914-1 libdrm 2.4.56-1 mesa 10.2.4-1 linux 3.16-1
Still on 3.16 -- as expected. Guess the proposed experiments for this go into the 3.17 merge window? [ 0.536244] [drm] Initialized drm 1.1.0 20060810 [ 0.536720] [drm] Memory usable by graphics device = 2048M [ 0.536783] [drm] Replacing VGA console driver [ 0.537337] Console: switching to colour dummy device 80x25 [ 0.543156] i915 0000:00:02.0: irq 44 for MSI/MSI-X [ 0.543168] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 0.543175] [drm] Driver supports precise vblank timestamp query. [ 0.543255] vgaarb: device changed decodes: PCI:0000:00:02.0,olddecodes=io+mem,decodes=io+mem:owns=io+mem [ 0.619102] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00000298 tail 00000000 start 000fd000 [expected 000fd000] [ 0.619114] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged
3.16 still doesn't work for me exactly the same way as before.
Created attachment 104181 [details] [review] frob ring_stop a bit I think this is something we haven't tried yet. dmesg with results highly welcome.
Sadly, the last patch does nothing for me. simon@thinkpad:~$ dmesg | grep drm [ 0.539701] [drm] Initialized drm 1.1.0 20060810 [ 0.540192] [drm] Memory usable by graphics device = 2048M [ 0.540256] [drm] Replacing VGA console driver [ 0.547166] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013). [ 0.547173] [drm] Driver supports precise vblank timestamp query. [ 0.624101] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00000204 tail 00000000 start 000fd000 [expected 000fd000] [ 0.624113] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged [ 0.648508] fbcon: inteldrmfb (fb0) is primary device [ 1.186237] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device [ 1.222687] [drm] Initialized i915 1.6.0 20080730 for 0000:00:02.0 on minor 0
Created attachment 104208 [details] [review] frob + debug ring head Ok, I'm honestly lost what's going on right now. Can you please retest with this patch, which has piles of debug output?
Created attachment 104212 [details] dmesg with patch from comment #80 Here is a dmesg from two suspend-resume cycles with Daniel's patch from comment #80 (applied on top of all previous patches from Chris). There seems to be a change in behavior! Interestingly, I am not seeing [drm:init_ring_common] *ERROR* render ring initialization failed .... messages any more, and seems like I am indeed running accelerated still (i.e. the graphic seems to be still reasonably fast and usable). For some reason there is still render ring initialization failed [head now 01000000], fudging message present though. I will now try the patch from comment #80 on top of something more recent than the 3.15-based branch with all Chris' patches applied, and will let you know.
Created attachment 104216 [details] dmesg of 3.16 + patch from comment #80 Woohooo! Daniel, your patch from comment #80 seems to make the bug go away! Attached is dmesg from 3.16 + patch from comment #80. No more ring initialization errors in dmesg, everything working properly. I'll do a couple more suspend/resume cycles to be sure that the problem just doesn't happen less frequently and will report back. So either it's the wait_for_atomic() -> wait_for() change, the extra wait_for() added in order to wait for the head to be cleared, or the extra I915_READ_HEAD() reads inserted. Unless you now have a clear idea of what's happening, I'll try to isolate which of the changes in the patch is the one that makes the difference.
Created attachment 104224 [details] [review] [PATCH] drm/i915: read HEAD register back in init_ring_common() to enforce ordering Ok, this is the minimal change that reliably makes my system behave properly again finally (woohoo!). Chris, Daniel, what do you think?
Okay, after 31 suspend-resume cycles, the problem appeared again (while without the patch, it triggers with 100% reliability) with patch from comment #83 applied. So it's not a complete fix, it just makes the problem much less likely to happen.
The intel_ring_setup_status_page() does a posting read anyway, so it is not an ordering issue. So this is back in the magic read territory, can you try with a msleep(10) instead of the read just to confirm that it is the read doing the trick and not an extra delay?
It's still very strange that HEAD starts to move once we've initialized the ring. So it seems like we can properly reset it, but then it goes banas ... More dmesgs from different machines with that frob+debug patch definitely appreciated.
(In reply to comment #85) > The intel_ring_setup_status_page() does a posting read anyway, so it is not > an ordering issue. So this is back in the magic read territory, can you try > with a msleep(10) instead of the read just to confirm that it is the read > doing the trick and not an extra delay? With msleep() used instead of the register read, the problem triggers fully reliably again; i.e. the "magic read" really does some trick, although it's not a complete cure.
(In reply to comment #86) > It's still very strange that HEAD starts to move once we've initialized the > ring. So it seems like we can properly reset it, but then it goes banas ... > > More dmesgs from different machines with that frob+debug patch definitely > appreciated. I will provide you with dmesg output from failing resume with your frob+debug patch once the issue triggers with it applied (it hasn't so far).
Created attachment 104226 [details] [review] dmesg with the frob+debug patch from comment #80 showing the issue Finally after many suspend-resume cycles, the issue triggered also with Daniel's frob+debug patch from comment #80. Resulting dmesg attached.
Jiri, can you please submit your patch from commment #83 to upstream? It's not perfect, but ducttape is good, so I'll merge it as an interim solution.
(In reply to comment #90) > Jiri, can you please submit your patch from commment #83 to upstream? It's > not perfect, but ducttape is good, so I'll merge it as an interim solution. I can definitely do that if that's your preferred course of action. This will mean that smaller number of people will be hitting the bug and hence being available to test proper fixes (hopefully the dmesg from comment #89 will provoke some idea?). OTOH, I will always be here to test any patches to have a final fix, so if that's enough for you, then fine :)
Created attachment 104227 [details] [review] head start before enabling Another crazy idea. Looking at logs and Jiri's patch, the critical step seems to be when we set the valid bit. Let's see what happens if we give the ring a headstart, hopefully catching the moving ring. You can experiment with different values, as long as they're a multiple of 8. 64 might be magic since it's the cacheline size (which in a few w/a is really important for register writes, even though that's strange).
(In reply to comment #92) > Created attachment 104227 [details] [review] [review] > head start before enabling > > Another crazy idea. Looking at logs and Jiri's patch, the critical step > seems to be when we set the valid bit. Let's see what happens if we give the > ring a headstart, hopefully catching the moving ring. > > You can experiment with different values, as long as they're a multiple of > 8. 64 might be magic since it's the cacheline size (which in a few w/a is > really important for register writes, even though that's strange). This patch causes another ring initialization failure, 100%, during boot (i.e. even no suspend-resume cycle necessary) [ 3.496122] [drm:init_ring_common] *ERROR* bsd ring initialization failed ctl 0001f001 (valid? 1) head 00000008 tail 00000040 start 00107000 [expected 00107000] [ 3.496256] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged
One of the earlier patches is now available as a standalone module, http://patchwork.freedesktop.org/patch/31266/ as Ville found a suspiciously similar w/a for g4x.
No more failures on boot or resume here so far, using Jiri's one-liner. Will see if this is consistent. Thanks all! :-)
(In reply to comment #95) > No more failures on boot or resume here so far, using Jiri's one-liner. Will > see if this is consistent. > > Thanks all! :-) Thanks for testing. Please bear in mind though that this is a workaround that makes the bug less likely to happen, but it's still possible that it triggers. (In reply to comment #89) > Created attachment 104226 [details] [review] [review] > dmesg with the frob+debug patch from comment #80 showing the issue > > Finally after many suspend-resume cycles, the issue triggered also with > Daniel's frob+debug patch from comment #80. > > Resulting dmesg attached. Daniel, did that make any sense whatsoever to you? There is obvious difference in the value of 'After init' 0x01000000 (working case) vs. 0x000e4004 (broken case), and nothing else pops up to me.
I'll be having the affected notebook with me next week in Chicago on Kernel Summit in case it'd help you with debugging ... ?
Hello. I'm getting this in dmesg: [ 12.413399] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged OpenGL is broken in my system, Xv is also broken. I can't watch any videos in mpv/mplayer/vlc with vo_xv. Also, glxgears returns this: [diego@myhost school]$ glxgears Running synchronized to the vertical refresh. The framerate should be approximately the same as the monitor refresh rate. intel_do_flush_locked failed: Invalid argument [diego@myhost school]$ Is this the same bug or should I open another one?
Arch Linux (x86_64) here.
Same here: Aug 19 08:59:09 localhost kernel: [ 7.606690] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head 00009654 tail 00000000 start 000e4000 [expected 000e4000] Aug 19 08:59:09 localhost kernel: [ 7.606711] [drm:i915_gem_init] *ERROR* Failed to initialize GPU, declaring it wedged Archlinux, AMD64, kernel 3.16.1
Jiri Kosina's patch that more or less fixes this (at least for now and on my system...) is already in 3.17-rc1. So you could either patch your current version or upgrade. :-)
Will Linux 3.16.2 include this fix/workaround?
Fixed by commit ece4a17d237a79f63fbfaf3f724a12b6d500555c Author: Jiri Kosina <jkosina@suse.cz> Date: Thu Aug 7 16:29:53 2014 +0200 drm/i915: read HEAD register back in init_ring_common() to enforce ordering (In reply to comment #102) > Will Linux 3.16.2 include this fix/workaround? It will eventually be backported to supported stable kernels, but when that happens depends on the stable team.
I suggest to keep the bug still open for quite some time. We all know (and my stress-testing underlines that) that this is a duct-tape and not a real fix. I am still planning to spend some more time on this. If you hate having this bug assigned to Intel as you believe there's not much that you can do (which is my current understanding), please feel free to re-assign it to me. I'll close it either once I am completely out of crazy ideas, or I find a reliable fix. Thanks.
(In reply to comment #104) > I suggest to keep the bug still open for quite some time. Ok, dropping regression and reducing priority.
commit 95468892fdfeef6d1004b524e35957629efdbe00 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Aug 7 15:39:54 2014 +0100 drm/i915: Reset the HEAD pointer for the ring after writing START Ville found an old w/a documented for g4x that suggested that we need to reset the HEAD after writing START. This is a useful fixup for some of the g4x ring initialisation woes, but as usual, not all.
I applied Jiris patch (attachment 104224 [details] [review]) against a vanilla 3.16.3 and I can confirm that it mitigates the issue. I see no more "render ring initialization failed" errors, but there seem to be side effects, at least on my Thinkpad T400 (GM45, Debian stable). I noticed that mouse movement becomes sluggish (without applications open, didn't find the reason) and as a showstopper, my whole system froze. I coudn't get out of X with ctrl-alt-backspace, and when I tried to switch to tty1 I briefly saw a stack dump before the screen went entirely black (not off). Would you like a dmesg or a drm debug log? If there are any other patches available, please tell me.
Created attachment 106514 [details] dmesg 3.16.2 Another dmesg. Kernel 3.16.2 (chakra, default).
I may have have been bugged by this issue since 3.15/3.16 kernels, since: --- From 78f2975eec9faff353a6194e854d3d39907bab68 Mon Sep 17 00:00:00 2001 From: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed, 2 Apr 2014 16:36:07 +0100 Subject: drm/i915: Move all ring resets before setting the HWS page In commit a51435a3137ad8ae75c288c39bd2d8b2696bae8f Author: Naresh Kumar Kachhi <naresh.kumar.kachhi@intel.com> Date: Wed Mar 12 16:39:40 2014 +0530 drm/i915: disable rings before HW status page setup we reordered stopping the rings to do so before we set the HWS register. However, there is an extra workaround for g45 to reset the rings twice, and for consistency we should apply that workaround before setting the HWS to be sure that the rings are truly stopped. --- I continuously got things like the following after SUSPEND-TO-DISK: [53556.636015] [drm:init_ring_common] *ERROR* render ring initialization failed ctl 0001f001 (valid? 1) head ffff8814 tail 00000000 start 0013c000 [expected 0013c000] [53556.636018] [drm:__i915_drm_thaw] *ERROR* failed to re-initialize GPU, declaring wedged! After which all xv video playing fails (blank==black smplayer/vlc content window). This one was with 3.16.5, which already includes Jiri's patch (in kernel since vanilla 3.16.4). --- The only way to fix this issue for me, since I felt affected @3.15.7, was at first to revert commit 78f2975eec9faff353a6194e854d3d39907bab68 (see:https://bitbucket.org/alfredchen/linux-gc/commits/b462d396377b2765d1ed3b416bcf90dc02a85293/raw/) and to continuously follow it's children with kernel's evolution, to revert them too. Until now. I'll attach this _working_ patch for 3.16.5.in the following. In the hope of bringing some inspiration to you all, although it is somekind of history, already?! If you'd need more info, please, only ask for it. Best regards, Manuel Krause
Created attachment 107730 [details] Extended-Revert-drm-i915-Move-all-ring-resets-before-setting-the-HWS-page.patch For kernels from 3.16.4 and upwards you'd need -- at first -- to REVERT commit ece4a17d237a79f63fbfaf3f724a12b6d500555c BEFORE applying this.
I wanted to play with this a little bit more, but even after reverting the band-aid implemented in ece4a17d23, I haven't been able to reproduce the problem so far with 3.18-rc1+ (HEAD == c2661b8060) after 40 suspend-resume cycles (which is a timeframe when the problem usually triggered in the past). I'll keep trying to reproduce the problem, but it'd be nice if others who have been able to reproduce the problem would be able to try with current Linus' tree (3.18-rc1 and later) with ece4a17d23 workaround reverted, and report their findings. Thanks.
Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my machine. I'll come back later with my re-testing when this is fixed. But I'll definitely do.
*** Bug 86067 has been marked as a duplicate of this bug. ***
(In reply to Manuel Krause from comment #112) > Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my > machine. I'll come back later with my re-testing when this is fixed. But > I'll definitely do. We're later in -rc ... any updates on the state of ring init on g4x?
(In reply to Daniel Vetter from comment #114) > (In reply to Manuel Krause from comment #112) > > Unfortunately the 3.18.0-rc1 appears to be highly buggy/ unstable on my > > machine. I'll come back later with my re-testing when this is fixed. But > > I'll definitely do. > > We're later in -rc ... any updates on the state of ring init on g4x? I'm still beeing hit by another BUG that prevents me to test this one here under real world conditions. With a 3.18-rc5 I'd get the following already during booting, after which no Xv video playback is possible anymore while glxgears works: from dmesg: [drm:intel_pipe_config_compare] *ERROR* mismatch in pipe_src_w (expected 0, found 4096) [ 36.653126] ------------[ cut here ]------------ [ 36.653211] WARNING: CPU: 0 PID: 712 at drivers/gpu/drm/i915/intel_display.c:10966 check_crtc_state+0x7b3/0x1010 [i915]() [ 36.653215] pipe state doesn't match! [ 36.653218] Modules linked in: nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv4 iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables vboxdrv(O) xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables fuse snd_hda_codec_hdmi snd_hda_codec_analog snd_hda_codec_generic coretemp kvm_intel kvm hp_wmi sparse_keymap rfkill iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_seq microcode snd_timer snd_seq_device joydev serio_raw snd lpc_ich mfd_core tg3 libphy ptp pps_core soundcore wmi battery tpm_infineon [ 36.653305] tpm_tis tpm hp_accel lis3lv02d input_polldev ac evdev acpi_cpufreq sg loop dm_mod ipv6 autofs4 btrfs raid6_pq xor i915 drm_kms_helper drm video i2c_algo_bit button [ 36.653339] CPU: 0 PID: 712 Comm: Xorg Tainted: G O 3.18.0-rc5-vanilla #1 [ 36.653344] Hardware name: Hewlett-Packard HP Compaq 6730b (KU489ET#ABD)/30DD, BIOS 68PDD Ver. F.20 12/07/2011 [ 36.653348] 0000000000000009 ffff8800b7c13848 ffffffff814bcd76 0000000000000000 [ 36.653356] ffff8800b7c13898 ffff8800b7c13888 ffffffff8103c347 ffff8800b7c13890 [ 36.653362] ffff880139da4060 ffff880139057000 ffff880139130000 ffff880139057330 [ 36.653370] Call Trace: [ 36.653386] [<ffffffff814bcd76>] dump_stack+0x4e/0x71 [ 36.653395] [<ffffffff8103c347>] warn_slowpath_common+0x77/0xa0 [ 36.653402] [<ffffffff8103c3b1>] warn_slowpath_fmt+0x41/0x50 [ 36.653463] [<ffffffffa012ac91>] ? intel_lvds_get_config+0x41/0xe0 [i915] [ 36.653515] [<ffffffffa00f5e73>] check_crtc_state+0x7b3/0x1010 [i915] [ 36.653525] [<ffffffff81063638>] ? dequeue_task_fair+0x368/0x4b0 [ 36.653581] [<ffffffffa010592f>] intel_modeset_check_state+0x27f/0x790 [i915] [ 36.653634] [<ffffffffa0105ed0>] intel_set_mode+0x20/0x30 [i915] [ 36.653687] [<ffffffffa0106e7c>] intel_crtc_set_config+0x91c/0xe40 [i915] [ 36.653732] [<ffffffffa002af01>] drm_mode_set_config_internal+0x61/0xf0 [drm] [ 36.653770] [<ffffffffa002f424>] drm_mode_setcrtc+0xd4/0x590 [drm] [ 36.653800] [<ffffffffa00217ac>] drm_ioctl+0x19c/0x630 [drm] [ 36.653814] [<ffffffff8111ee80>] do_vfs_ioctl+0x2e0/0x4c0 [ 36.653822] [<ffffffff8111f0e1>] SyS_ioctl+0x81/0xa0 [ 36.653831] [<ffffffff814c2bd6>] system_call_fastpath+0x16/0x1b [ 36.653837] ---[ end trace 41cc44d460ddfb2f ]--- [ 36.655549] [drm:intel_pipe_config_compare] *ERROR* mismatch in pipe_src_w (expected 0, found 4096) [ 36.655556] ------------[ cut here ]------------ [ 36.655618] WARNING: CPU: 0 PID: 712 at drivers/gpu/drm/i915/intel_display.c:10966 check_crtc_state+0x7b3/0x1010 [i915]() [ 36.655622] pipe state doesn't match! [ 36.655625] Modules linked in: nf_log_ipv6 xt_pkttype nf_log_ipv4 nf_log_common xt_LOG xt_limit pci_stub vboxpci(O) vboxnetadp(O) vboxnetflt(O) ip6t_REJECT nf_reject_ipv6 xt_tcpudp nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_raw ipt_REJECT nf_reject_ipv4 iptable_raw xt_CT iptable_filter ip6table_mangle nf_conntrack_netbios_ns nf_conntrack_broadcast nf_conntrack_ipv4 nf_defrag_ipv4 ip_tables vboxdrv(O) xt_conntrack nf_conntrack ip6table_filter ip6_tables x_tables fuse snd_hda_codec_hdmi snd_hda_codec_analog snd_hda_codec_generic coretemp kvm_intel kvm hp_wmi sparse_keymap rfkill iTCO_wdt iTCO_vendor_support snd_hda_intel snd_hda_controller snd_hda_codec snd_hwdep snd_pcm snd_seq microcode snd_timer snd_seq_device joydev serio_raw snd lpc_ich mfd_core tg3 libphy ptp pps_core soundcore wmi battery tpm_infineon [ 36.655709] tpm_tis tpm hp_accel lis3lv02d input_polldev ac evdev acpi_cpufreq sg loop dm_mod ipv6 autofs4 btrfs raid6_pq xor i915 drm_kms_helper drm video i2c_algo_bit button [ 36.655740] CPU: 0 PID: 712 Comm: Xorg Tainted: G W O 3.18.0-rc5-vanilla #1 [ 36.655745] Hardware name: Hewlett-Packard HP Compaq 6730b (KU489ET#ABD)/30DD, BIOS 68PDD Ver. F.20 12/07/2011 [ 36.655749] 0000000000000009 ffff8800b7c13848 ffffffff814bcd76 0000000000000000 [ 36.655756] ffff8800b7c13898 ffff8800b7c13888 ffffffff8103c347 ffff8800b7c13890 [ 36.655763] ffff880139da4060 ffff880139057000 ffff880139130000 ffff880139057330 [ 36.655770] Call Trace: [ 36.655782] [<ffffffff814bcd76>] dump_stack+0x4e/0x71 [ 36.655790] [<ffffffff8103c347>] warn_slowpath_common+0x77/0xa0 [ 36.655797] [<ffffffff8103c3b1>] warn_slowpath_fmt+0x41/0x50 [ 36.655860] [<ffffffffa012ac91>] ? intel_lvds_get_config+0x41/0xe0 [i915] [ 36.655913] [<ffffffffa00f5e73>] check_crtc_state+0x7b3/0x1010 [i915] [ 36.655971] [<ffffffffa010592f>] intel_modeset_check_state+0x27f/0x790 [i915] [ 36.656010] [<ffffffffa0105ed0>] intel_set_mode+0x20/0x30 [i915] [ 36.656114] [<ffffffffa0106cc1>] intel_crtc_set_config+0x761/0xe40 [i915] [ 36.656156] [<ffffffffa002af01>] drm_mode_set_config_internal+0x61/0xf0 [drm] [ 36.656194] [<ffffffffa002f424>] drm_mode_setcrtc+0xd4/0x590 [drm] [ 36.656224] [<ffffffffa00217ac>] drm_ioctl+0x19c/0x630 [drm] [ 36.656239] [<ffffffff8111ee80>] do_vfs_ioctl+0x2e0/0x4c0 [ 36.656247] [<ffffffff8111f0e1>] SyS_ioctl+0x81/0xa0 [ 36.656256] [<ffffffff814c2bd6>] system_call_fastpath+0x16/0x1b [ 36.656261] ---[ end trace 41cc44d460ddfb30 ]--- [ 36.691044] [drm:i9xx_set_fifo_underrun_reporting] *ERROR* pipe A underrun [ 37.153072] [drm:i9xx_set_fifo_underrun_reporting] *ERROR* pipe A underrun [ 37.153084] [drm:i965_irq_handler] *ERROR* pipe A underrun from Xorg.0.log: [ 36.043] (II) AIGLX: Loaded and initialized i965 [ 36.043] (II) GLX: Initialized DRI2 GL provider for screen 0 [ 36.067] (WW) intel(0): Failed to submit rendering commands, trying again with outputs disabled. [ 36.655] (EE) intel(0): unable to attach scanout [ 36.655] (EE) intel(0): Failed to submit rendering commands, disabling acceleration. This may have nothing to do with this bug here, but I'd be glad if someone of you could lead me to an existing related bugreport (google didn't help) or even a solution. Normally I'm now using a 3.17.4 kernel (without my reverting patch) -- and I still experience this bug here regularily (approx. twice in three reboots and if hibernation correctly worked once, it'll work on) but it stays unpredictable. I had been thinking that it also depends on the userspace Xorg/intel parts for long time, but I'm not sure about that anymore, as I can't prove it over several months of regularily actualising these parts of my openSUSE. Best regards, I hope my info helps a bit, Manuel Krause
(In reply to Manuel Krause from comment #115) > from Xorg.0.log: > [ 36.043] (II) AIGLX: Loaded and initialized i965 > [ 36.043] (II) GLX: Initialized DRI2 GL provider for screen 0 > [ 36.067] (WW) intel(0): Failed to submit rendering commands, trying > again with outputs disabled. > [ 36.655] (EE) intel(0): unable to attach scanout > [ 36.655] (EE) intel(0): Failed to submit rendering commands, disabling > acceleration. That's PIN_BIAS fallout.
(In reply to Chris Wilson from comment #116) > (In reply to Manuel Krause from comment #115) > > from Xorg.0.log: > > [ 36.043] (II) AIGLX: Loaded and initialized i965 > > [ 36.043] (II) GLX: Initialized DRI2 GL provider for screen 0 > > [ 36.067] (WW) intel(0): Failed to submit rendering commands, trying > > again with outputs disabled. > > [ 36.655] (EE) intel(0): unable to attach scanout > > [ 36.655] (EE) intel(0): Failed to submit rendering commands, disabling > > acceleration. > > That's PIN_BIAS fallout. I don't know what to do with this information. :-( Kernel 3.18-rc6 has the same issue -- is there something I can do against it ?
(In reply to Manuel Krause from comment #117) > (In reply to Chris Wilson from comment #116) > > (In reply to Manuel Krause from comment #115) > > > from Xorg.0.log: > > > [ 36.043] (II) AIGLX: Loaded and initialized i965 > > > [ 36.043] (II) GLX: Initialized DRI2 GL provider for screen 0 > > > [ 36.067] (WW) intel(0): Failed to submit rendering commands, trying > > > again with outputs disabled. > > > [ 36.655] (EE) intel(0): unable to attach scanout > > > [ 36.655] (EE) intel(0): Failed to submit rendering commands, disabling > > > acceleration. > > > > That's PIN_BIAS fallout. > > I don't know what to do with this information. :-( > Kernel 3.18-rc6 has the same issue -- is there something I can do against it > ? http://www.spinics.net/lists/stable/msg71063.html should address this one - it's trickling through the queues (probably with a detour through 3.19).
(In reply to Daniel Vetter from comment #118) > (In reply to Manuel Krause from comment #117) > > (In reply to Chris Wilson from comment #116) > > > (In reply to Manuel Krause from comment #115) > > > > from Xorg.0.log: > > > > [ 36.043] (II) AIGLX: Loaded and initialized i965 > > > > [ 36.043] (II) GLX: Initialized DRI2 GL provider for screen 0 > > > > [ 36.067] (WW) intel(0): Failed to submit rendering commands, trying > > > > again with outputs disabled. > > > > [ 36.655] (EE) intel(0): unable to attach scanout > > > > [ 36.655] (EE) intel(0): Failed to submit rendering commands, disabling > > > > acceleration. > > > > > > That's PIN_BIAS fallout. > > > > I don't know what to do with this information. :-( > > Kernel 3.18-rc6 has the same issue -- is there something I can do against it > > ? > > http://www.spinics.net/lists/stable/msg71063.html should address this one - > it's trickling through the queues (probably with a detour through 3.19). Thank you very very much!! This patch heals the above mentioned issue. So, if I see it correctly, the current testing on this bug is suspend-/resuming on 3.18-rc with ece4a17d23 workaround reverted -- to see if the symptoms reoccur at all? How many iterations are considered as senseful or sufficient?
I hope, that the patch cited in Comment 118 will get included as soon as possible, as this enabled my testing (or even more, would enable me using 3.18?). For the 3.18.0-rc6 with the workaround I've done: 5 hibernates+resumes /reboot same kernel/ 5 hibernates+resumes /reboot same kernel/ 1 hibernation+resume: No issues. For the 3.18.0-rc6 WITHOUT the workaround I've done: 5 hibernates+resumes /reboot to my 3.17.4 everyday-kernel/ hibernate+resume/ reboot to 3.18.0-rc6 WITHOUT the workaround, again, and this row done 4 times. No issues. I know these tests don't go up to Jiri's 40 suspend/resume count, but my ones have been done under real world conditions (loading applications, playing video, etc.). Can someone of you explain, what makes 3.18 that different from 3.17 in this case? It would be nice to see the related improvements to be backported to 3.17. Thank you in advance, Manuel
(In reply to Manuel Krause from comment #120) > I hope, that the patch cited in Comment 118 will get included as soon as > possible, as this enabled my testing (or even more, would enable me using > 3.18?). That's in v3.18.4.
The ring init appears to have been fixed in v3.18, at least there has been no further reports since we merged the w/a patch.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.