62191 – Buffer Object corruption with multiple processes

Bug 62191 - Buffer Object corruption with multiple processes

Summary: Buffer Object corruption with multiple processes

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	medium blocker
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-03-11 20:03 UTC by jon.bloomfield
Modified:	2017-07-24 22:58 UTC (History)
CC List:	0 users

See Also:
i915 platform:
i915 features:

Attachments
drm buffer object bug demo program (4.57 KB, text/plain) 2013-03-11 20:03 UTC, jon.bloomfield	no flags	Details
Shoot those elephants! (3.08 KB, patch) 2013-03-21 15:21 UTC, Chris Wilson	no flags	Details \| Splinter Review
rotf (3.06 KB, patch) 2013-03-22 14:36 UTC, Chris Wilson	no flags	Details \| Splinter Review
W/a for VLV (3.62 KB, patch) 2013-05-21 12:24 UTC, Chris Wilson	no flags	Details \| Splinter Review
Show Obsolete (1) View All

Description jon.bloomfield 2013-03-11 20:03:21 UTC

Created attachment 76351 [details]
drm buffer object bug demo program

When several processes attempt to use gem buffer objects simultaneously data written to the buffers is occasionally lost.

The attached C program, selftest.c, demonstrates the issue. The program spawns 1 or more child processes each of which runs exactly the same test sequence:

  create tiled buffer object
  loop 10 times
    map buffer to gtt
    fill buffer with unique data, and verify each write immediately (A)
    unmap buffer

    map buffer to gtt
    readback and verify data in buffer (B)
    unmap buffer
  end loop

When run with more than about 20 test processes the readback of written data occasionally fails. The chance of errors being introduced increases with the number of processes.

Note that the program verifies the data in two places. Once as each DWord of data is written to the buffer (A), and again in a separate pass after the entire buffer has been filled. If an error is detected at (A) the child exits, so any errors reported at (B) imply that the data did originally write correctly to the buffer, but has subsequently been lost.

Each processes data is unique, and is formed by the DWord offset in the lower 5 nibbles, followed by the loop counter in the 6th nibble, and finally the process instance in the upper byte.

In all failures I have seen to date no cross-process corruption has occurred. The erroneous data always corresponds to data previously written by the SAME process, usually from a previous iteration of the loop.

Latest kernel tested: 3.8.0-RC6+ from git://people.freedesktop.org/~danvet/drm-intel (2013-03-07)

Comment 1 Chris Wilson 2013-03-11 20:26:04 UTC

You have multiple processes writing to the same memory addresses with no serialisation. The results are undefined.

Comment 2 Chris Wilson 2013-03-11 20:30:49 UTC

Meh, I take that back, I thought surface_bo was shared, but it is indeed created after the fork().

Comment 3 Chris Wilson 2013-03-11 21:04:57 UTC

To summarise: multiple cores vs fence thrashing vs llc.

Comment 4 Chris Wilson 2013-03-11 22:34:16 UTC

And not just any old fence, as it only occurs with y-major fences.

Comment 5 jon.bloomfield 2013-03-12 09:07:54 UTC

Running X Major with 100 processes still shows corruption. It does seem slightly better than Y Major though which fails at just 30 processes on my machine at least.

Running with no tiling, does indeed seem fine

Comment 6 Chris Wilson 2013-03-12 11:37:27 UTC

Ok, X is also bad, must have left it at NONE whilst testing. As expected nosmp works...

Comment 7 Chris Wilson 2013-03-12 18:46:27 UTC

Also disabling wc mapping for the aperture work (i.e. using uncached accesses for GTT maps).

Comment 8 Chris Wilson 2013-03-21 11:25:48 UTC

Hack du jour to prevent fence migration:

diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c
index 19fc21b..52240f8 100644
--- a/drivers/gpu/drm/i915/i915_gem.c
+++ b/drivers/gpu/drm/i915/i915_gem.c
@@ -2759,7 +2759,10 @@ i915_find_fence_reg(struct drm_device *dev)
 	}
 
 	if (avail == NULL)
-		return NULL;
+		return ERR_PTR(-EDEADLK);
+
+	set_need_resched();
+	return ERR_PTR(-EAGAIN);
 
 	/* None available, try to steal one or wait for a user to finish */
 	list_for_each_entry(reg, &dev_priv->mm.fence_list, lru_list) {
@@ -2814,8 +2817,8 @@ i915_gem_object_get_fence(struct drm_i915_gem_object *obj)
 		}
 	} else if (enable) {
 		reg = i915_find_fence_reg(dev);
-		if (reg == NULL)
-			return -EDEADLK;
+		if (IS_ERR(reg))
+			return PTR_ERR(reg);
 
 		if (reg->obj) {
 			struct drm_i915_gem_object *old = reg->obj;

Comment 9 Chris Wilson 2013-03-21 15:21:45 UTC

Created attachment 76867 [details] [review]
Shoot those elephants!

Comment 10 Chris Wilson 2013-03-22 13:22:02 UTC

For the record, the per-cpu mmio only works on SNB. A new trick is required for the IVB herd.

Comment 11 Chris Wilson 2013-03-22 14:36:01 UTC

Created attachment 76915 [details] [review]
rotf

Comment 12 jon.bloomfield 2013-03-22 15:36:00 UTC

Tested-by: "Bloomfield, Jon" <jon.bloomfield@intel.com<mailto:jon.bloomfield@intel.com>>
---------------------------------------------------------------------
Intel Corporation (UK) Limited
Registered No. 1134945 (England)
Registered Office: Pipers Way, Swindon SN3 1RJ
VAT No: 860 2173 47

This e-mail and any attachments may contain confidential material for
the sole use of the intended recipient(s). Any review or distribution
by others is strictly prohibited. If you are not the intended
recipient, please contact the sender and delete all copies.

Comment 13 Chris Wilson 2013-03-22 15:48:29 UTC

The good news (from my perspective) is that 2D performance appears to be unaffected by calling stop-machine per fence update.

Comment 14 Chris Wilson 2013-04-04 16:40:37 UTC

Daniel, can we please apply the patch? I know you enjoy testing QA...

Comment 15 Daniel Vetter 2013-04-04 18:11:18 UTC

(In reply to comment #14)
> Daniel, can we please apply the patch? I know you enjoy testing QA...

Was Mika's version of it in i-g-t not good enough?

commit ee79b8fccd64218fb8218d1c281ff34db1724e87
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Wed Mar 27 12:48:07 2013 +0200

    tests: add write-verify test to gem_fence_thrash

But yeah, QA didn't yet report this one here ...

Comment 16 Chris Wilson 2013-04-04 18:20:13 UTC

I'm looking forward to seeing the fix land upstream...

Comment 17 Daniel Vetter 2013-04-07 19:25:02 UTC

The test (as-is in igt) is imo still a bit too unreliable to trigger it, at least for my set of machines here. Evil ideas to improve it highly welcome. I've posted a patch which at least somewhat improves things.

Comment 18 Daniel Vetter 2013-04-08 10:35:52 UTC

Fix merged to dinq with cc: stable:

commit f002ca62b04e6050fa60f822e7869df0b31c8b33
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Apr 4 21:31:03 2013 +0100

    drm/i915: Workaround incoherence between fences and LLC across multiple CPUs

Comment 19 Daniel Vetter 2013-05-03 16:35:26 UTC

vlv is supposedly also affected ...

Comment 20 jon.bloomfield 2013-05-10 12:17:31 UTC

It looks like the fix, as written, is not working on VLV, even after removing the test for LLC and making the invalidate unconditional. Symptoms are identical to the original bug on IVB.

Comment 21 Chris Wilson 2013-05-10 14:27:33 UTC

Did you try with stop_machine()? *g*

Comment 22 jon.bloomfield 2013-05-10 14:30:51 UTC

We'll try that. Thanks

Comment 23 jon.bloomfield 2013-05-14 13:12:44 UTC

We tried two stop_machine() experiments on VLV - The first just called wbinvd() from stop machine, instead of on_each_cpu(). This made no difference.

However, reverting to the v2 patch, and calling i915_gem_write_fence(), from stop_machine, appears to have resolved the problem. We ran 1000 iterations of the selftest bug-demo with no failures. We also had to remove the HAS_LLC test, since VLV has no LLC.

Note that we surrounded the call to i915_gem_write_fence() with wbinvd(). We didn't have time to see if it was possible to remove these invalidate operations.

The relevant code-portions are shown below...

static int i915_gem_write_fence__ipi(void *data)
{
	struct write_fence *args = data;
	wbinvd();
	i915_gem_write_fence(args->dev, args->fence, args->obj);
	wbinvd();
	return 0;
}

static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj,
					 struct drm_i915_fence_reg *fence,
					 bool enable)
{
	struct drm_i915_private *dev_priv = obj->base.dev->dev_private;
	struct write_fence args = {
		.dev = obj->base.dev,
		.fence = fence_number(dev_priv, fence),
		.obj = enable ? obj : NULL,
	};

	stop_machine(i915_gem_write_fence__ipi, &args, cpu_possible_mask);

	if (enable) {
		obj->fence_reg = args.fence;
		fence->obj = obj;
		list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list);
	} else {
		obj->fence_reg = I915_FENCE_REG_NONE;
		fence->obj = NULL;
		list_del_init(&fence->lru_list);
	}
}

Comment 24 Chris Wilson 2013-05-21 12:24:45 UTC

Created attachment 79614 [details] [review]
W/a for VLV

Jon, can you please test this patch on all the machines you have available?

Comment 25 jon.bloomfield 2013-05-22 15:59:42 UTC

I can confirm that the new patch works on IVB, HSW, and VLV2.

Comment 26 Daniel Vetter 2013-05-23 07:52:51 UTC

Improved patch merged to dinq:

commit 69dc79f8930900bf021dcc3cfcba3141cb373145
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 22 17:08:06 2013 +0100

    drm/i915: Workaround incoherence with fence updates on Valleyview

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.