Bug 45492

Summary:

[snb] render ring overflow

Product:

DRI

Reporter:

Chris Wilson <chris>

Component:

DRM/Intel

Assignee:

Daniel Vetter <daniel>

Status:

CLOSED FIXED

QA Contact:

Severity:

normal

Priority:

medium

CC:

ben, chris, daniel, eugeni, jbarnes, yi.sun

Version:

unspecified

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

i915 features:

Bug Depends on:

Bug Blocks:

42991, 44622

Attachments:

Description	Flags
An example of the error state	none
Sample error state	none
always bind to ppgtt (patch against current my-next)	none
No more HEAD please, I'm British!	none
Return tail from ring->add_request()	none
Use request->head to guess current ring->head	none
Keep using request->head until we run out of requests	none
Use request->tail to guess current ring->head	none
Revert use of autoreported head	none
Record tail at each request and use to estimate ring->head.	none
A sample error state showing requests	none

Description Chris Wilson 2012-02-01 04:19:02 UTC

Created attachment 56439 [details]
An example of the error state

Having updated sna to prefer to use the BLT for doing fills (it is far faster than using the current shaders on RENDER) if we have semaphores, I am now encountering a reproducible hang whilst running cairo-traces (though not always in the same spot).

sna: commit df4e1059a4e09998334dde6aa1c8ccfe76e442c0
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Jan 31 19:19:13 2012 +0000

    sna/gen6: Prefer to do fills using the BLT

Comment 1 Chris Wilson 2012-02-01 04:19:52 UTC

Various magic patches tried so far: finish-gpu, o_l_r seqno fix, rc6=0, fbc=0, forcewaked

Comment 2 Chris Wilson 2012-02-01 04:25:12 UTC

Also in case it matters, this is an mba4.1 (11") with an i5-2467m. Now where's the stepping info?... ;-)

Comment 3 Chris Wilson 2012-02-01 05:06:08 UTC

Created attachment 56445 [details]
Sample error state

Updated error state to include inter-ring sync registers, as supposedly used by the semaphore mbox.

Comment 4 Daniel Vetter 2012-02-01 07:46:03 UTC

Created attachment 56453 [details] [review]
always bind to ppgtt (patch against current my-next)

There are too many stern warnings that not having 1:1 global:per-process gtt bindings can lead to hilarity, and one of the saner theories to explain the error_state is to assume that the gpu read a bunch of MI_NOOP from the render ring (which it would if it would read through ppgtt).

So let's try to rule that out first.

Comment 5 Chris Wilson 2012-02-01 08:55:58 UTC

Good news! No effect with forcing the 1:1 mapping of ggtt:ppgtt.

Comment 6 Chris Wilson 2012-02-01 09:05:08 UTC

The light at the end of the tunnel seems to be not to use LLC-caching on the ring buffers... But first lets see if it survives just one complete cycle!

Comment 7 Chris Wilson 2012-02-01 09:15:25 UTC

No, it didn't make it to the end of the cairo-traces.

Comment 8 Chris Wilson 2012-02-01 10:21:55 UTC

The reported head postion through the HWS is just a random value:

[   55.365394] ring 0: bogus return from HWS: 9023488 > 2099072
[   55.530138] ring 0: bogus return from HWS: 113971200 > 2142592
[   55.770873] ring 0: bogus return from HWS: 94113792 > 2208832
[   56.012128] ring 0: bogus return from HWS: 146026496 > 4236360
[   56.037306] ring 0: bogus return from HWS: 269692928 > 4241160
[   56.042295] ring 0: bogus return from HWS: 9060352 > 4242120
[   56.043003] ring 0: bogus return from HWS: 9060352 > 4242428

Comment 9 Chris Wilson 2012-02-01 10:46:23 UTC

Created attachment 56463 [details] [review]
No more HEAD please, I'm British!

My theory as to why this requires semaphores and became easier to hit with the pipe-control w/a, is that we need to exhaust the ring buffer often enough to read a bogus and dangerous value. Using semaphores (or UXA) generates lots of BLT/RENDER switches and lots of tiny batches, and the pipe-control magnified the amount of space required for every batch.

Comment 10 Daniel Vetter 2012-02-01 12:18:12 UTC

commit 6aa56062eaba67adfb247cded244fd877329588d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 29 21:44:37 2010 +0100

    drm/i915/ringbuffer: Use the HEAD auto-reporting mechanism

... the offending commit that introduced the issue.

Comment 11 Chris Wilson 2012-02-01 12:23:47 UTC

For the record, the debugging patch was equivalent to:

diff --git a/drivers/gpu/drm/i915/intel_ringbuffer.c b/drivers/gpu/drm/i915/intel_ringbuffer.c
index 023739f..54420ed 100644
--- a/drivers/gpu/drm/i915/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/intel_ringbuffer.c
@@ -1065,12 +1065,18 @@ int intel_wait_ring_buffer(struct intel_ring_buffer *ring, int n)
 	 * fallback to the slow and accurate path.
 	 */
 	head = intel_read_status_page(ring, 4);
+#if 0
 	if (head > ring->head) {
 		ring->head = head;
 		ring->space = ring_space(ring);
 		if (ring->space >= n)
 			return 0;
 	}
+#else
+	if (head > I915_READ_HEAD(ring))
+		printk("ring %d: read of reported HEAD %d is in advance of actual HEAD %d\n",
+		       ring->id, head, I915_READ_HEAD(ring));
+#endif
 
 	trace_i915_ring_wait_begin(ring);
 	if (drm_core_check_feature(dev, DRIVER_GEM))

Comment 12 Chris Wilson 2012-02-01 14:03:56 UTC

It does come with a consequence for *GPU* traces:

Slowdowns
=========
 xlib           firefox-fishtank  2324.51 -> 12551.04:  5.40x slowdown
 xlib          firefox-paintball   882.43 -> 19323.76: 21.90x slowdown

Comment 13 Chris Wilson 2012-02-01 14:04:18 UTC

It does come with a consequence for *GPU-bound* traces:

Slowdowns
=========
 xlib           firefox-fishtank  2324.51 -> 12551.04:  5.40x slowdown
 xlib          firefox-paintball   882.43 -> 19323.76: 21.90x slowdown

Comment 14 Chris Wilson 2012-02-01 14:41:20 UTC

Created attachment 56479 [details] [review]
Return tail from ring->add_request()

Comment 15 Chris Wilson 2012-02-01 14:42:02 UTC

Created attachment 56480 [details] [review]
Use request->head to guess current ring->head

Comment 16 Chris Wilson 2012-02-01 15:40:34 UTC

Created attachment 56481 [details] [review]
Keep using request->head until we run out of requests

Ok, using the request to estimate HEAD prevents the performance regression, and things appear to be stable again on SNB. Testing on PNV now.

Comment 17 Chris Wilson 2012-02-02 04:12:17 UTC

Argh. Looks like I fluffed the testing and that the perf-cases I have exploit this bug... Bizarre, very bizarre...

Comment 18 Chris Wilson 2012-02-02 07:49:27 UTC

Created attachment 56522 [details] [review]
Use request->tail to guess current ring->head

Current patch that seems non-buggy, but no perf improvement (despite the old changelog!)

Comment 19 Eugeni Dodonov 2012-02-02 08:27:13 UTC

*** Bug 45273 has been marked as a duplicate of this bug. ***

Comment 20 Chris Wilson 2012-02-08 03:58:20 UTC

A sample of autoreported heads from snb:
[  141.970680] head now 00e10000, actual 00008238, bogus? 1
[  141.974062] head now 02734000, actual 000083c8, bogus? 1
[  141.974425] head now 00e10000, actual 00008488, bogus? 1
[  141.980374] head now 032b5000, actual 000088b8, bogus? 1
[  141.980885] head now 03271000, actual 00008950, bogus? 1
[  142.040628] head now 02101000, actual 00008b40, bogus? 1
[  142.180173] head now 02734000, actual 00009050, bogus? 1
[  142.181090] head now 00000000, actual 00000ae0, bogus? 0
[  142.183737] head now 02734000, actual 00009050, bogus? 1
[  142.183747] head now 00000000, actual 00000b30, bogus? 0
[  142.183755] head now 02734000, actual 00009050, bogus? 1
[  142.190904] head now 03271000, actual 00009110, bogus? 1
[  142.216101] head now 02101000, actual 00009290, bogus? 1
[  142.242984] head now 00e54000, actual 00009420, bogus? 1
[  142.280395] head now 00e10000, actual 000095a0, bogus? 1
[  142.299506] head now 02185000, actual 00009660, bogus? 1
[  142.314656] head now 05642000, actual 00009720, bogus? 1
[  142.353926] head now 02734000, actual 000098a0, bogus? 1

Comment 21 Chris Wilson 2012-02-08 04:16:09 UTC

From q35:
[  217.975608] head now 00010000, actual 00010000, bogus? 0
[  436.725613] head now 00200000, actual 00200000, bogus? 0
[  462.956033] head now 00210000, actual 00210010, bogus? 0
[  485.501409] head now 00400000, actual 00400020, bogus? 0
[  508.064280] head now 00410000, actual 00410000, bogus? 0
[  530.576078] head now 00600000, actual 00600020, bogus? 0

The misses there are equally worrying, but at least don't cause us to blow up. Also no evidence that this the long sort after all-generation bug.

Comment 22 Chris Wilson 2012-02-08 04:18:31 UTC

Ok, the failure on q35 is starting to look extremely systematic:
[  217.975608] head now 00010000, actual 00010000, bogus? 0
[  436.725613] head now 00200000, actual 00200000, bogus? 0
[  462.956033] head now 00210000, actual 00210010, bogus? 0
[  485.501409] head now 00400000, actual 00400020, bogus? 0
[  508.064280] head now 00410000, actual 00410000, bogus? 0
[  530.576078] head now 00600000, actual 00600020, bogus? 0
[  553.273489] head now 00610000, actual 00610018, bogus? 0
[  575.732116] head now 00800000, actual 00800028, bogus? 0
[  598.396240] head now 00810000, actual 00810058, bogus? 0
[  620.868948] head now 00a00000, actual 00a00008, bogus? 0
[  726.153771] head now 00a10000, actual 00a10008, bogus? 0

That is either the wrap counter is being incremented twice, or we miss notifications for the odd wrap values.

Comment 23 Chris Wilson 2012-02-08 06:25:31 UTC

Created attachment 56749 [details] [review]
Revert use of autoreported head

Comment 24 Chris Wilson 2012-02-08 06:26:32 UTC

Created attachment 56750 [details] [review]
Record tail at each request and use to estimate ring->head.

Comment 25 Chris Wilson 2012-02-08 13:08:31 UTC

Created attachment 56765 [details]
A sample error state showing requests

Comment 26 Chris Wilson 2012-02-11 04:20:28 UTC

Boo, hiss! With proper waiting, we go from 12Mglyphs/s to 5.2Mglyphs/s on i5-2520m. :(

I was so happy to have broken the 10Mglyphs/s barrier and thrashed nvidia into submission. Oh hardware, how I hate thee.

Comment 27 Chris Wilson 2012-02-24 06:37:08 UTC

*** Bug 46575 has been marked as a duplicate of this bug. ***

Comment 28 Chris Wilson 2012-02-24 06:37:58 UTC

Note that i-g-t/tests/gem_ringfill was enhanced to exercise this bug.

Comment 29 Chris Wilson 2012-02-28 11:17:20 UTC

On the way to linus, currently sitting in airlied/drm-fixes:

commit 5d031e5b633d910f35e6e0abce94d9d842390006
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Feb 8 13:34:13 2012 +0000

    drm/i915: Remove use of the autoreported ringbuffer HEAD position
    
    This is a revert of 6aa56062eaba67adfb247cded244fd877329588d.
    
    This was originally introduced to workaround reads of the ringbuffer
    registers returning 0 on SandyBridge causing hangs due to ringbuffer
    overflow. The root cause here was reads through the GT powerwell require
    the forcewake dance, something we only learnt of later. Now it appears
    that reading the reported head position from the HWS is returning
    garbage, leading once again to hangs.
    
    For example, on q35 the autoreported head reports:
      [  217.975608] head now 00010000, actual 00010000
      [  436.725613] head now 00200000, actual 00200000
      [  462.956033] head now 00210000, actual 00210010
      [  485.501409] head now 00400000, actual 00400020
      [  508.064280] head now 00410000, actual 00410000
      [  530.576078] head now 00600000, actual 00600020
      [  553.273489] head now 00610000, actual 00610018
    which appears reasonably sane. In contrast, if we look at snb:
      [  141.970680] head now 00e10000, actual 00008238
      [  141.974062] head now 02734000, actual 000083c8
      [  141.974425] head now 00e10000, actual 00008488
      [  141.980374] head now 032b5000, actual 000088b8
      [  141.980885] head now 03271000, actual 00008950
      [  142.040628] head now 02101000, actual 00008b40
      [  142.180173] head now 02734000, actual 00009050
      [  142.181090] head now 00000000, actual 00000ae0
      [  142.183737] head now 02734000, actual 00009050
    
    In addition, the automatic reporting of the head position is scheduled
    to be defeatured in the future. It has no more utility, remove it.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=45492
    Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch>
    Tested-by: Eric Anholt <eric@anholt.net>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Jesse Barnes <jbarnes@virtuousgeek.org>

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.