Bug 51738

Summary: [SNB regression?] Aborted (core dumped) showed during gem_dummy_reloc_loop/gem_ring_sync_loop test
Product: DRI Reporter: Du Yan <yanx.du>
Component: DRM/IntelAssignee: Daniel Vetter <daniel>
Status: CLOSED FIXED QA Contact:
Severity: critical    
Priority: high CC: ben, chris, daniel, florian, guang.a.yang, jbarnes, yi.sun
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
gem_dummy_reloc_loop dmesg log
none
gem_ring_sync_loop dmesg log
none
errorstate_gem_dummy_reloc_loop
none
errorstate_gem_ring_sync_loop
none
Trace of the BSD hang
none
Kill the delay in wait_for_atomic() none

Description Du Yan 2012-07-04 18:57:16 UTC
Created attachment 63841 [details]
gem_dummy_reloc_loop dmesg log

System Environment:
--------------------------
Platform: Sandybridge
Kernel: (drm-intel-next-queued)dab92e08895537c7260c811f8e93cbf749f0b727
Some additional commit info:
Author: Eugeni Dodonov <eugeni.dodonov@intel.com>
Date:   Mon Jul 2 11:51:08 2012 -0300
drm/i915: disable RC6 when disabling rps


Bug detailed description:
-------------------------
Aborted (core dumped) show during gem_dummy_reloc_loop/ gem_ring_sync_loop test. dmesg log has attached.

[root@x-sgb4 tests]# ./gem_dummy_reloc_loop
running dummy loop on render
dummy loop run on render completed
running dummy loop on bsd
intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5
Aborted (core dumped)

[root@x-sgb4 tests]# ./gem_ring_sync_loop
intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5
Aborted (core dumped)
Comment 1 Du Yan 2012-07-04 18:58:30 UTC
Created attachment 63842 [details]
gem_ring_sync_loop dmesg log
Comment 2 Daniel Vetter 2012-07-05 00:34:25 UTC
We need the error state.
Comment 3 Chris Wilson 2012-07-05 00:41:00 UTC
Also whether ./gem_dummy_reloc_loop reliably triggers the GPU hang (errno=5)?
Comment 4 Chris Wilson 2012-07-05 00:46:25 UTC
That would seem to be true....

bsd command stream:
  HEAD: 0x0000000c
  TAIL: 0x00000050
  ACTHD: 0x0000000c
  IPEIR: 0x00000000
  IPEHR: 0x13040081
  INSTDONE: 0xffabffff
    busy: EU 12
    busy: EU 10
    busy: EU 02
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00023050
  FAULT_REG: 0x00000000
  SYNC_0: 0x00f4775d
  SYNC_1: 0x0104775f
  seqno: 0x00000000
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00000050

batchbuffer (bsd ring) at 0x04446000:
0x04446000:      0x13000001: MI_FLUSH_DW post_sync_op='no write' 
0x04446004:      0x00000000:    address
0x04446008:      0x04445000:    dword
0x0444600c:      0x0040000f: MI_NOOP write NOPID reg, val=0xf
0x04446010:      0x00000000: MI_NOOP
0x04446014:      0x05000000: MI_BATCH_BUFFER_END

bsd ring --- 1 requests
  seqno 0x01047760, emitted 4303505930, tail 0x00000028
ringbuffer (bsd ring) at 0x00023000:
0x00023000:      0x13040081: MI_FLUSH_DW invalidate tlb, post_sync_op='no write'
  invalidate video state (BCS-only),
0x00023004:      0x00000000:    address
0x00023008:      0x00000000:    dword
0x0002300c:      0x00000000: MI_NOOP
0x00023010:      0x18800100: MI_BATCH_BUFFER_START
0x00023014:      0x04446000:    dword 1
0x00023018:      0x13000001: MI_FLUSH_DW post_sync_op='no write' 
0x0002301c:      0x00000000:    address
0x00023020:      0x00000000:    dword
0x00023024:      0x00000000: MI_NOOP
0x00023028:      0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0
0x0002302c:      0x01047760:    value
0x00023030:      0x00002040:    address
0x00023034:      0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0
0x00023038:      0x01047760:    value
0x0002303c:      0x00022044:    address
0x00023040:      0x10800001: MI_STORE_DATA_INDEX
0x00023044:      0x00000080:    index
0x00023048:      0x01047760:    dword
0x0002304c:      0x01000000: MI_USER_INTERRUPT


So dying on the very first FLUSH_DW of the BSD ring. I guess some of the recent changes (differing rc6 for media ring, any other chickens?) are culpable.
Comment 5 Chris Wilson 2012-07-05 00:55:57 UTC
Also does appear limited to SNB... So recent HW enablement seems most suspect.
Comment 6 Du Yan 2012-07-05 01:20:34 UTC
Created attachment 63849 [details]
errorstate_gem_dummy_reloc_loop
Comment 7 Du Yan 2012-07-05 01:21:34 UTC
Created attachment 63850 [details]
errorstate_gem_ring_sync_loop
Comment 8 Chris Wilson 2012-07-05 05:34:44 UTC
990bbdadabaa51828e475eda86ee5720a4910cc3 is the first bad commit
commit 990bbdadabaa51828e475eda86ee5720a4910cc3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 2 11:51:02 2012 -0300

    drm/i915: Group the GT routines together in both code and vtable

:(
Comment 9 Chris Wilson 2012-07-05 06:26:11 UTC
diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 0e27e95..9dcb401 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -3978,6 +3978,7 @@ static void __gen6_gt_force_wake_get(struct drm_i915_private *dev_priv)
 		DRM_ERROR("Force wake wait timed out\n");
 
 	I915_WRITE_NOTRACE(FORCEWAKE, 1);
+	POSTING_READ(FORCEWAKE);
 
 	if (wait_for_atomic_us((I915_READ_NOTRACE(forcewake_ack) & 1), 500))
 		DRM_ERROR("Force wake wait timed out\n");
Comment 10 Chris Wilson 2012-07-05 06:37:14 UTC
Or s/POSTING_FLUSH/udelay(10)/
Comment 11 Chris Wilson 2012-07-05 06:44:14 UTC
Created attachment 63862 [details]
Trace of the BSD hang
Comment 12 Chris Wilson 2012-07-05 10:06:08 UTC
So the minimum appears to be a ndelay(250). Nor does it appear to be a missing mb(), so it feels like the delay is king. I poked hard at the MI_FLUSH_DW on the the BSD and the tail write w/a, and that does seems to affect it. Nor does disabling rc6 or rps.

And the delay has to be between the FORCEWAKE write and the read of the ACK.

I HAVE NO CLUE. It could be an elephant or an iceberg...
Comment 13 Chris Wilson 2012-07-05 10:34:54 UTC
Created attachment 63867 [details] [review]
Kill the delay in wait_for_atomic()

And since I can't explain it, we may as well replace the udelay() whilst busy-waiting with a cpu_relax(). That way we should complete the atomic sections as fast as possible with roughly the same power efficiency as udelay().
Comment 14 Chris Wilson 2012-07-05 12:44:33 UTC
An unexplained hang that seems dependent upon an unrelated timing makes me extremely nervous.
Comment 15 Du Yan 2012-07-27 05:53:47 UTC
Test with the patch 64704 and patch 64706 from #bug 52424.
Both patches can fix this bug.
Comment 16 Daniel Vetter 2012-08-06 08:59:58 UTC
Ok, I've smashed the two patches onto -fixes again, let's hope it holds. Please confirm that drm-intel-fixes really works now.
Comment 17 Du Yan 2012-08-10 02:21:39 UTC
Issue still can reproduce with the drm-intel-queued commit

Kernel: (drm-intel-next-queued)65bccb5c708bd9f00d24f041f4f7c45130359448
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Aug 8 17:42:52 2012 +0200

    drm/i915: fixup desired rps frequency computation



But test OK with the below drm-intel-fixes commit

Kernel: (drm-intel-fixes)0d8957c8a90bbb5d34fab9a304459448a5131e06
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Aug 7 09:54:14 2012 +0200

    drm/i915: correctly order the ring init sequence
Comment 18 Daniel Vetter 2012-08-10 08:43:11 UTC
Yeah, -next is not always a subset of -fixes, to avoid tons of spurious merged. I'm closing this as confirmed.
Comment 19 Florian Mickler 2012-08-20 21:55:55 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc2:

commit 6af2d180f82151cf3d58952e35a4f96e45bc453a
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jul 26 16:24:50 2012 +0200

    drm/i915: fix forcewake related hangs on snb
Comment 20 Florian Mickler 2012-08-20 21:57:51 UTC
A patch referencing this bug report has been merged in Linux v3.6-rc2:

commit bcf9dcc1e6269fac674e41f25d007ff75f76e840
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 15 09:42:38 2012 +0100

    drm/i915: Workaround hang with BSD and forcewake on SandyBridge
Comment 21 Florian Mickler 2012-10-15 20:55:12 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc1:

commit 8dee3eea3ccd3b6c00a8d3a08dd715d6adf737dd
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Sat Sep 1 22:59:50 2012 -0700

    drm/i915: Never read FORCEWAKE
Comment 22 Florian Mickler 2013-01-26 10:49:51 UTC
A patch referencing a commit referencing this bug report has been merged in Linux v3.8-rc5:

commit b514407547890686572606c9dfa4b7f832db9958
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Thu Jan 17 10:24:09 2013 +0200

    drm/i915: fix FORCEWAKE posting reads
Comment 23 Elizabeth 2017-10-06 14:49:13 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.