51738 – [SNB regression?] Aborted (core dumped) showed during gem_dummy_reloc_loop/gem_ring_sync_loop test

Bug 51738 - [SNB regression?] Aborted (core dumped) showed during gem_dummy_reloc_loop/gem_ring_sync_loop test

Summary: [SNB regression?] Aborted (core dumped) showed during gem_dummy_reloc_loop/ge...

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	high critical
Assignee:	Daniel Vetter
QA Contact:

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2012-07-04 18:57 UTC by Du Yan
Modified:	2017-10-06 14:49 UTC (History)
CC List:	7 users (show)

See Also:
i915 platform:
i915 features:

Attachments
gem_dummy_reloc_loop dmesg log (55.30 KB, text/plain) 2012-07-04 18:57 UTC, Du Yan	no flags	Details
gem_ring_sync_loop dmesg log (51.09 KB, text/plain) 2012-07-04 18:58 UTC, Du Yan	no flags	Details
errorstate_gem_dummy_reloc_loop (2.09 MB, text/plain) 2012-07-05 01:20 UTC, Du Yan	no flags	Details
errorstate_gem_ring_sync_loop (2.36 MB, text/plain) 2012-07-05 01:21 UTC, Du Yan	no flags	Details
Trace of the BSD hang (1.44 MB, application/x-ns-proxy-autoconfig) 2012-07-05 06:44 UTC, Chris Wilson	no flags	Details
Kill the delay in wait_for_atomic() (1.04 KB, patch) 2012-07-05 10:34 UTC, Chris Wilson	no flags	Details \| Splinter Review
View All

Description Du Yan 2012-07-04 18:57:16 UTC

Created attachment 63841 [details]
gem_dummy_reloc_loop dmesg log

System Environment:
--------------------------
Platform: Sandybridge
Kernel: (drm-intel-next-queued)dab92e08895537c7260c811f8e93cbf749f0b727
Some additional commit info:
Author: Eugeni Dodonov <eugeni.dodonov@intel.com>
Date:   Mon Jul 2 11:51:08 2012 -0300
drm/i915: disable RC6 when disabling rps


Bug detailed description:
-------------------------
Aborted (core dumped) show during gem_dummy_reloc_loop/ gem_ring_sync_loop test. dmesg log has attached.

[root@x-sgb4 tests]# ./gem_dummy_reloc_loop
running dummy loop on render
dummy loop run on render completed
running dummy loop on bsd
intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5
Aborted (core dumped)

[root@x-sgb4 tests]# ./gem_ring_sync_loop
intel_batchbuffer_flush_on_ring:119 failed, ret=-5, errno=5
Aborted (core dumped)

Comment 1 Du Yan 2012-07-04 18:58:30 UTC

Created attachment 63842 [details]
gem_ring_sync_loop dmesg log

Comment 2 Daniel Vetter 2012-07-05 00:34:25 UTC

We need the error state.

Comment 3 Chris Wilson 2012-07-05 00:41:00 UTC

Also whether ./gem_dummy_reloc_loop reliably triggers the GPU hang (errno=5)?

Comment 4 Chris Wilson 2012-07-05 00:46:25 UTC

That would seem to be true....

bsd command stream:
  HEAD: 0x0000000c
  TAIL: 0x00000050
  ACTHD: 0x0000000c
  IPEIR: 0x00000000
  IPEHR: 0x13040081
  INSTDONE: 0xffabffff
    busy: EU 12
    busy: EU 10
    busy: EU 02
  INSTPS: 0x00000000
  INSTPM: 0x00000000
  FADDR: 0x00023050
  FAULT_REG: 0x00000000
  SYNC_0: 0x00f4775d
  SYNC_1: 0x0104775f
  seqno: 0x00000000
  waiting: yes
  ring->head: 0x00000000
  ring->tail: 0x00000050

batchbuffer (bsd ring) at 0x04446000:
0x04446000:      0x13000001: MI_FLUSH_DW post_sync_op='no write' 
0x04446004:      0x00000000:    address
0x04446008:      0x04445000:    dword
0x0444600c:      0x0040000f: MI_NOOP write NOPID reg, val=0xf
0x04446010:      0x00000000: MI_NOOP
0x04446014:      0x05000000: MI_BATCH_BUFFER_END

bsd ring --- 1 requests
  seqno 0x01047760, emitted 4303505930, tail 0x00000028
ringbuffer (bsd ring) at 0x00023000:
0x00023000:      0x13040081: MI_FLUSH_DW invalidate tlb, post_sync_op='no write'
  invalidate video state (BCS-only),
0x00023004:      0x00000000:    address
0x00023008:      0x00000000:    dword
0x0002300c:      0x00000000: MI_NOOP
0x00023010:      0x18800100: MI_BATCH_BUFFER_START
0x00023014:      0x04446000:    dword 1
0x00023018:      0x13000001: MI_FLUSH_DW post_sync_op='no write' 
0x0002301c:      0x00000000:    address
0x00023020:      0x00000000:    dword
0x00023024:      0x00000000: MI_NOOP
0x00023028:      0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0
0x0002302c:      0x01047760:    value
0x00023030:      0x00002040:    address
0x00023034:      0x0b640001: MI_SEMAPHORE_MBOX global gtt, update semaphore, use compare reg 0
0x00023038:      0x01047760:    value
0x0002303c:      0x00022044:    address
0x00023040:      0x10800001: MI_STORE_DATA_INDEX
0x00023044:      0x00000080:    index
0x00023048:      0x01047760:    dword
0x0002304c:      0x01000000: MI_USER_INTERRUPT


So dying on the very first FLUSH_DW of the BSD ring. I guess some of the recent changes (differing rc6 for media ring, any other chickens?) are culpable.

Comment 5 Chris Wilson 2012-07-05 00:55:57 UTC

Also does appear limited to SNB... So recent HW enablement seems most suspect.

Comment 6 Du Yan 2012-07-05 01:20:34 UTC

Created attachment 63849 [details]
errorstate_gem_dummy_reloc_loop

Comment 7 Du Yan 2012-07-05 01:21:34 UTC

Created attachment 63850 [details]
errorstate_gem_ring_sync_loop

Comment 8 Chris Wilson 2012-07-05 05:34:44 UTC

990bbdadabaa51828e475eda86ee5720a4910cc3 is the first bad commit
commit 990bbdadabaa51828e475eda86ee5720a4910cc3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 2 11:51:02 2012 -0300

    drm/i915: Group the GT routines together in both code and vtable

:(

Comment 9 Chris Wilson 2012-07-05 06:26:11 UTC

diff --git a/drivers/gpu/drm/i915/intel_pm.c b/drivers/gpu/drm/i915/intel_pm.c
index 0e27e95..9dcb401 100644
--- a/drivers/gpu/drm/i915/intel_pm.c
+++ b/drivers/gpu/drm/i915/intel_pm.c
@@ -3978,6 +3978,7 @@ static void __gen6_gt_force_wake_get(struct drm_i915_private *dev_priv)
 		DRM_ERROR("Force wake wait timed out\n");
 
 	I915_WRITE_NOTRACE(FORCEWAKE, 1);
+	POSTING_READ(FORCEWAKE);
 
 	if (wait_for_atomic_us((I915_READ_NOTRACE(forcewake_ack) & 1), 500))
 		DRM_ERROR("Force wake wait timed out\n");

Comment 10 Chris Wilson 2012-07-05 06:37:14 UTC

Or s/POSTING_FLUSH/udelay(10)/

Comment 11 Chris Wilson 2012-07-05 06:44:14 UTC

Created attachment 63862 [details]
Trace of the BSD hang

Comment 12 Chris Wilson 2012-07-05 10:06:08 UTC

So the minimum appears to be a ndelay(250). Nor does it appear to be a missing mb(), so it feels like the delay is king. I poked hard at the MI_FLUSH_DW on the the BSD and the tail write w/a, and that does seems to affect it. Nor does disabling rc6 or rps.

And the delay has to be between the FORCEWAKE write and the read of the ACK.

I HAVE NO CLUE. It could be an elephant or an iceberg...

Comment 13 Chris Wilson 2012-07-05 10:34:54 UTC

Created attachment 63867 [details] [review]
Kill the delay in wait_for_atomic()

And since I can't explain it, we may as well replace the udelay() whilst busy-waiting with a cpu_relax(). That way we should complete the atomic sections as fast as possible with roughly the same power efficiency as udelay().

Comment 14 Chris Wilson 2012-07-05 12:44:33 UTC

An unexplained hang that seems dependent upon an unrelated timing makes me extremely nervous.

Comment 15 Du Yan 2012-07-27 05:53:47 UTC

Test with the patch 64704 and patch 64706 from #bug 52424.
Both patches can fix this bug.

Comment 16 Daniel Vetter 2012-08-06 08:59:58 UTC

Ok, I've smashed the two patches onto -fixes again, let's hope it holds. Please confirm that drm-intel-fixes really works now.

Comment 17 Du Yan 2012-08-10 02:21:39 UTC

Issue still can reproduce with the drm-intel-queued commit

Kernel: (drm-intel-next-queued)65bccb5c708bd9f00d24f041f4f7c45130359448
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Wed Aug 8 17:42:52 2012 +0200

    drm/i915: fixup desired rps frequency computation



But test OK with the below drm-intel-fixes commit

Kernel: (drm-intel-fixes)0d8957c8a90bbb5d34fab9a304459448a5131e06
Some additional commit info:
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Tue Aug 7 09:54:14 2012 +0200

    drm/i915: correctly order the ring init sequence

Comment 18 Daniel Vetter 2012-08-10 08:43:11 UTC

Yeah, -next is not always a subset of -fixes, to avoid tons of spurious merged. I'm closing this as confirmed.

Comment 19 Florian Mickler 2012-08-20 21:55:55 UTC

A patch referencing this bug report has been merged in Linux v3.6-rc2:

commit 6af2d180f82151cf3d58952e35a4f96e45bc453a
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Thu Jul 26 16:24:50 2012 +0200

    drm/i915: fix forcewake related hangs on snb

Comment 20 Florian Mickler 2012-08-20 21:57:51 UTC

A patch referencing this bug report has been merged in Linux v3.6-rc2:

commit bcf9dcc1e6269fac674e41f25d007ff75f76e840
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sun Jul 15 09:42:38 2012 +0100

    drm/i915: Workaround hang with BSD and forcewake on SandyBridge

Comment 21 Florian Mickler 2012-10-15 20:55:12 UTC

A patch referencing a commit referencing this bug report has been merged in Linux v3.7-rc1:

commit 8dee3eea3ccd3b6c00a8d3a08dd715d6adf737dd
Author: Ben Widawsky <ben@bwidawsk.net>
Date:   Sat Sep 1 22:59:50 2012 -0700

    drm/i915: Never read FORCEWAKE

Comment 22 Florian Mickler 2013-01-26 10:49:51 UTC

A patch referencing a commit referencing this bug report has been merged in Linux v3.8-rc5:

commit b514407547890686572606c9dfa4b7f832db9958
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Thu Jan 17 10:24:09 2013 +0200

    drm/i915: fix FORCEWAKE posting reads

Comment 23 Elizabeth 2017-10-06 14:49:13 UTC

Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.