Summary: | Buffer Object corruption with multiple processes | ||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | jon.bloomfield | ||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||
Severity: | blocker | ||||||||||||
Priority: | medium | ||||||||||||
Version: | unspecified | ||||||||||||
Hardware: | x86-64 (AMD64) | ||||||||||||
OS: | Linux (All) | ||||||||||||
Whiteboard: | |||||||||||||
i915 platform: | i915 features: | ||||||||||||
Attachments: |
|
You have multiple processes writing to the same memory addresses with no serialisation. The results are undefined. Meh, I take that back, I thought surface_bo was shared, but it is indeed created after the fork(). To summarise: multiple cores vs fence thrashing vs llc. And not just any old fence, as it only occurs with y-major fences. Running X Major with 100 processes still shows corruption. It does seem slightly better than Y Major though which fails at just 30 processes on my machine at least. Running with no tiling, does indeed seem fine Ok, X is also bad, must have left it at NONE whilst testing. As expected nosmp works... Also disabling wc mapping for the aperture work (i.e. using uncached accesses for GTT maps). Hack du jour to prevent fence migration: diff --git a/drivers/gpu/drm/i915/i915_gem.c b/drivers/gpu/drm/i915/i915_gem.c index 19fc21b..52240f8 100644 --- a/drivers/gpu/drm/i915/i915_gem.c +++ b/drivers/gpu/drm/i915/i915_gem.c @@ -2759,7 +2759,10 @@ i915_find_fence_reg(struct drm_device *dev) } if (avail == NULL) - return NULL; + return ERR_PTR(-EDEADLK); + + set_need_resched(); + return ERR_PTR(-EAGAIN); /* None available, try to steal one or wait for a user to finish */ list_for_each_entry(reg, &dev_priv->mm.fence_list, lru_list) { @@ -2814,8 +2817,8 @@ i915_gem_object_get_fence(struct drm_i915_gem_object *obj) } } else if (enable) { reg = i915_find_fence_reg(dev); - if (reg == NULL) - return -EDEADLK; + if (IS_ERR(reg)) + return PTR_ERR(reg); if (reg->obj) { struct drm_i915_gem_object *old = reg->obj; Created attachment 76867 [details] [review] Shoot those elephants! For the record, the per-cpu mmio only works on SNB. A new trick is required for the IVB herd. Created attachment 76915 [details] [review] rotf Tested-by: "Bloomfield, Jon" <jon.bloomfield@intel.com<mailto:jon.bloomfield@intel.com>> --------------------------------------------------------------------- Intel Corporation (UK) Limited Registered No. 1134945 (England) Registered Office: Pipers Way, Swindon SN3 1RJ VAT No: 860 2173 47 This e-mail and any attachments may contain confidential material for the sole use of the intended recipient(s). Any review or distribution by others is strictly prohibited. If you are not the intended recipient, please contact the sender and delete all copies. The good news (from my perspective) is that 2D performance appears to be unaffected by calling stop-machine per fence update. Daniel, can we please apply the patch? I know you enjoy testing QA... (In reply to comment #14) > Daniel, can we please apply the patch? I know you enjoy testing QA... Was Mika's version of it in i-g-t not good enough? commit ee79b8fccd64218fb8218d1c281ff34db1724e87 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Wed Mar 27 12:48:07 2013 +0200 tests: add write-verify test to gem_fence_thrash But yeah, QA didn't yet report this one here ... I'm looking forward to seeing the fix land upstream... The test (as-is in igt) is imo still a bit too unreliable to trigger it, at least for my set of machines here. Evil ideas to improve it highly welcome. I've posted a patch which at least somewhat improves things. Fix merged to dinq with cc: stable: commit f002ca62b04e6050fa60f822e7869df0b31c8b33 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Apr 4 21:31:03 2013 +0100 drm/i915: Workaround incoherence between fences and LLC across multiple CPUs vlv is supposedly also affected ... It looks like the fix, as written, is not working on VLV, even after removing the test for LLC and making the invalidate unconditional. Symptoms are identical to the original bug on IVB. Did you try with stop_machine()? *g* We'll try that. Thanks We tried two stop_machine() experiments on VLV - The first just called wbinvd() from stop machine, instead of on_each_cpu(). This made no difference. However, reverting to the v2 patch, and calling i915_gem_write_fence(), from stop_machine, appears to have resolved the problem. We ran 1000 iterations of the selftest bug-demo with no failures. We also had to remove the HAS_LLC test, since VLV has no LLC. Note that we surrounded the call to i915_gem_write_fence() with wbinvd(). We didn't have time to see if it was possible to remove these invalidate operations. The relevant code-portions are shown below... static int i915_gem_write_fence__ipi(void *data) { struct write_fence *args = data; wbinvd(); i915_gem_write_fence(args->dev, args->fence, args->obj); wbinvd(); return 0; } static void i915_gem_object_update_fence(struct drm_i915_gem_object *obj, struct drm_i915_fence_reg *fence, bool enable) { struct drm_i915_private *dev_priv = obj->base.dev->dev_private; struct write_fence args = { .dev = obj->base.dev, .fence = fence_number(dev_priv, fence), .obj = enable ? obj : NULL, }; stop_machine(i915_gem_write_fence__ipi, &args, cpu_possible_mask); if (enable) { obj->fence_reg = args.fence; fence->obj = obj; list_move_tail(&fence->lru_list, &dev_priv->mm.fence_list); } else { obj->fence_reg = I915_FENCE_REG_NONE; fence->obj = NULL; list_del_init(&fence->lru_list); } } Created attachment 79614 [details] [review] W/a for VLV Jon, can you please test this patch on all the machines you have available? I can confirm that the new patch works on IVB, HSW, and VLV2. Improved patch merged to dinq: commit 69dc79f8930900bf021dcc3cfcba3141cb373145 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed May 22 17:08:06 2013 +0100 drm/i915: Workaround incoherence with fence updates on Valleyview |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 76351 [details] drm buffer object bug demo program When several processes attempt to use gem buffer objects simultaneously data written to the buffers is occasionally lost. The attached C program, selftest.c, demonstrates the issue. The program spawns 1 or more child processes each of which runs exactly the same test sequence: create tiled buffer object loop 10 times map buffer to gtt fill buffer with unique data, and verify each write immediately (A) unmap buffer map buffer to gtt readback and verify data in buffer (B) unmap buffer end loop When run with more than about 20 test processes the readback of written data occasionally fails. The chance of errors being introduced increases with the number of processes. Note that the program verifies the data in two places. Once as each DWord of data is written to the buffer (A), and again in a separate pass after the entire buffer has been filled. If an error is detected at (A) the child exits, so any errors reported at (B) imply that the data did originally write correctly to the buffer, but has subsequently been lost. Each processes data is unique, and is formed by the DWord offset in the lower 5 nibbles, followed by the loop counter in the 6th nibble, and finally the process instance in the upper byte. In all failures I have seen to date no cross-process corruption has occurred. The erroneous data always corresponds to data previously written by the SAME process, usually from a previous iteration of the loop. Latest kernel tested: 3.8.0-RC6+ from git://people.freedesktop.org/~danvet/drm-intel (2013-03-07)