Created attachment 63841 [details] gem_dummy_reloc_loop dmesg log System Environment: -------------------------- Platform: Sandybridge Kernel: (drm-intel-next-queued)dab92e08895537c7260c811f8e93cbf749f0b727 Some additional commit info: Author: Eugeni Dodonov <eugeni.dodonov@intel.com> Date: Mon Jul 2 11:51:08 2012 -0300 drm/i915: disable RC6 when disabling rps Bug detailed description: ------------------------- Aborted (core dumped) show during gem_dummy_reloc_loop/ gem_ring_sync_loop test. dmesg log has attached. [root@x-sgb4 tests]# ./gem_dummy_reloc_loop running dummy loop on render dummy loop run on render completed running dummy loop on bsd intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5 Aborted (core dumped) [root@x-sgb4 tests]# ./gem_ring_sync_loop intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5 Aborted (core dumped)
Created attachment 63842 [details] gem_ring_sync_loop dmesg log
We need the error state.
Also whether ./gem_dummy_reloc_loop reliably triggers the GPU hang (errno=5)?
That would seem to be true.... bsd command stream: HEAD: 0x0000000c TAIL: 0x00000050 ACTHD: 0x0000000c IPEIR: 0x00000000 IPEHR: 0x13040081 INSTDONE: 0xffabffff busy: EU 12 busy: EU 10 busy: EU 02 INSTPS: 0x00000000 INSTPM: 0x00000000 FADDR: 0x00023050 FAULT_REG: 0x00000000 SYNC_0: 0x00f4775d SYNC_1: 0x0104775f seqno: 0x00000000 waiting: yes ring->head: 0x00000000 ring->tail: 0x00000050 batchbuffer (bsd ring) at 0x04446000: 0x04446000: 0x13000001: MI_FLUSH_DW post_sync_op='no write' 0x04446004: 0x00000000: address 0x04446008: 0x04445000: dword 0x0444600c: 0x0040000f: MI_NOOP write NOPID reg, val=0xf 0x04446010: 0x00000000: MI_NOOP 0x04446014: 0x05000000: MI_BATCH_BUFFER_END bsd ring --- 1 requests seqno 0x01047760, emitted 4303505930, tail 0x00000028 ringbuffer (bsd ring) at 0x00023000: 0x00023000: 0x13040081: MI_FLUSH_DW invalidate tlb, post_sync_op='no write' invalidate video state (BCS-only), 0x00023004: 0x00000000: address 0x00023008: 0x00000000: dword 0x0002300c: 0x00000000: MI_NOOP 0x00023010: 0x18800100: MI_BATCH_BUFFER_START 0x00023014: 0x04446000: dword 1 0x00023018: 0x13000001: MI_FLUSH_DW post_sync_op='no write' 0x0002301c: 0x00000000: address 0x00023020: 0x00000000: dword 0x00023024: 0x00000000: MI_NOOP 0x00023028: 0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0 0x0002302c: 0x01047760: value 0x00023030: 0x00002040: address 0x00023034: 0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0 0x00023038: 0x01047760: value 0x0002303c: 0x00022044: address 0x00023040: 0x10800001: MI_STORE_DATA_INDEX 0x00023044: 0x00000080: index 0x00023048: 0x01047760: dword 0x0002304c: 0x01000000: MI_USER_INTERRUPT So dying on the very first FLUSH_DW of the BSD ring. I guess some of the recent changes (differing rc6 for media ring, any other chickens?) are culpable.
Also does appear limited to SNB... So recent HW enablement seems most suspect.
Created attachment 63849 [details] errorstate_gem_dummy_reloc_loop
Created attachment 63850 [details] errorstate_gem_ring_sync_loop
990bbdadabaa51828e475eda86ee5720a4910cc3 is the first bad commit commit 990bbdadabaa51828e475eda86ee5720a4910cc3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jul 2 11:51:02 2012 -0300 drm/i915: Group the GT routines together in both code and vtable :(
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c index 0e27e95..9dcb401 100644 --- a/drivers/gpu/drm/i915/intel_pm.c +++ b/drivers/gpu/drm/i915/intel_pm.c @@ -3978,6 +3978,7 @@ static void __gen6_gt_force_wake_get(struct drm_i915_private *dev_priv) DRM_ERROR("Force wake wait timed out\n"); I915_WRITE_NOTRACE(FORCEWAKE, 1); + POSTING_READ(FORCEWAKE); if (wait_for_atomic_us((I915_READ_NOTRACE(forcewake_ack) & 1), 500)) DRM_ERROR("Force wake wait timed out\n");
Or s/POSTING_FLUSH/udelay(10)/
Created attachment 63862 [details] Trace of the BSD hang
So the minimum appears to be a ndelay(250). Nor does it appear to be a missing mb(), so it feels like the delay is king. I poked hard at the MI_FLUSH_DW on the the BSD and the tail write w/a, and that does seems to affect it. Nor does disabling rc6 or rps. And the delay has to be between the FORCEWAKE write and the read of the ACK. I HAVE NO CLUE. It could be an elephant or an iceberg...
Created attachment 63867 [details] [review] Kill the delay in wait_for_atomic() And since I can't explain it, we may as well replace the udelay() whilst busy-waiting with a cpu_relax(). That way we should complete the atomic sections as fast as possible with roughly the same power efficiency as udelay().
An unexplained hang that seems dependent upon an unrelated timing makes me extremely nervous.
Test with the patch 64704 and patch 64706 from #bug 52424. Both patches can fix this bug.
Ok, I've smashed the two patches onto -fixes again, let's hope it holds. Please confirm that drm-intel-fixes really works now.
Issue still can reproduce with the drm-intel-queued commit Kernel: (drm-intel-next-queued)65bccb5c708bd9f00d24f041f4f7c45130359448 Some additional commit info: Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Aug 8 17:42:52 2012 +0200 drm/i915: fixup desired rps frequency computation But test OK with the below drm-intel-fixes commit Kernel: (drm-intel-fixes)0d8957c8a90bbb5d34fab9a304459448a5131e06 Some additional commit info: Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Tue Aug 7 09:54:14 2012 +0200 drm/i915: correctly order the ring init sequence
Yeah, -next is not always a subset of -fixes, to avoid tons of spurious merged. I'm closing this as confirmed.
A patch referencing this bug report has been merged in Linux v3.6-rc2: commit 6af2d180f82151cf3d58952e35a4f96e45bc453a Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Thu Jul 26 16:24:50 2012 +0200 drm/i915: fix forcewake related hangs on snb
A patch referencing this bug report has been merged in Linux v3.6-rc2: commit bcf9dcc1e6269fac674e41f25d007ff75f76e840 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Jul 15 09:42:38 2012 +0100 drm/i915: Workaround hang with BSD and forcewake on SandyBridge
A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc1: commit 8dee3eea3ccd3b6c00a8d3a08dd715d6adf737dd Author: Ben Widawsky <ben@bwidawsk.net> Date: Sat Sep 1 22:59:50 2012 -0700 drm/i915: Never read FORCEWAKE
A patch referencing a commit referencing this bug report has been merged in Linux v3.8-rc5: commit b514407547890686572606c9dfa4b7f832db9958 Author: Jani Nikula <jani.nikula@intel.com> Date: Thu Jan 17 10:24:09 2013 +0200 drm/i915: fix FORCEWAKE posting reads
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.