Bug 91298

Summary: [BDW] piglit@spec@arb_shader_atomic_counters@unused-result intermittent failures
Product: DRI Reporter: Dylan Baker <baker.dylan.c>
Component: DRM/IntelAssignee: Francisco Jerez <currojerez>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: chris, currojerez, gary.c.wang, intel-gfx-bugs, lemody
Version: unspecified   
Hardware: Other   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Bug Depends on:    
Bug Blocks: 93185    
Attachments:
Description Flags
gen7_flush_batch_dc_writes.patch none

Description Dylan Baker 2015-07-10 18:35:56 UTC
This test intermittently fails on broadwell, sometimes the output makes sense, sometimes it doesn't.

For example, this failure (This is a fail, not a pass):
/tmp/build_root/m64/lib/piglit/bin/arb_shader_atomic_counters-unused-result -auto -fbo
Probe value at (0)
  Expected: 0x00000001
  Observed: 0x00000001
Comment 1 Chris Wilson 2015-07-13 08:35:36 UTC
It's bad synchronisation of the read buffer:

diff --git a/tests/spec/arb_shader_atomic_counters/common.c b/tests/spec/arb_shader_atomic_counters/common.c
index 95d809e..bf8309d 100644
--- a/tests/spec/arb_shader_atomic_counters/common.c
+++ b/tests/spec/arb_shader_atomic_counters/common.c
@@ -35,6 +35,7 @@ atomic_counters_probe_buffer(unsigned base, unsigned count,
         uint32_t *p = glMapBufferRange(
                 GL_ATOMIC_COUNTER_BUFFER, base * sizeof(uint32_t),
                 count * sizeof(uint32_t), GL_MAP_READ_BIT);
+        bool pass = true;
         unsigned i;
 
         if (!p) {
@@ -43,17 +44,18 @@ atomic_counters_probe_buffer(unsigned base, unsigned count,
         }
 
         for (i = 0; i < count; ++i) {
-                if (p[i] != expected[i]) {
+                uint32_t *found = p[i];
+                if (found != expected[i]) {
                         printf("Probe value at (%i)\n", i);
                         printf("  Expected: 0x%08x\n", expected[i]);
-                        printf("  Observed: 0x%08x\n", p[i]);
-                        glUnmapBuffer(GL_ATOMIC_COUNTER_BUFFER);
-                        return false;
+                        printf("  Observed: 0x%08x\n", found);
+                        pass = false;
+                        break;
                 }
         }
 
         glUnmapBuffer(GL_ATOMIC_COUNTER_BUFFER);
-        return true;
+        return pass;
 }
 
 bool

should fix up the error message.
Comment 2 Chris Wilson 2015-07-13 08:36:45 UTC
s/*found/found/ ofc
Comment 3 Kenneth Graunke 2015-10-21 08:14:11 UTC
I sent ickle's patch to the mailing list, and fixed the same bug in shader_runner and a compute shader test.

However, this just fixes the test failure message to make sense.

The driver still has a race condition that is causing these intermittent failures, and we need to fix that.
Comment 4 Kenneth Graunke 2015-11-07 05:40:23 UTC
It definitely seems to be some kind of race - adding usleep(1000) in the test inbetween glMapBufferRange and the p[i] access effectively hides the problem.

It also works fine on Haswell - I've run it for a few hours with no failures.

INTEL_DEBUG=sync doesn't help.  always_flush_batch=true seems to make it fail a little more often...it at minimum doesn't help...
Comment 5 Kenneth Graunke 2015-11-08 01:16:30 UTC
Chris, by chance do you have any ideas about this bug?

I'm fairly stumped...
Comment 6 Mark Janes 2015-11-12 00:23:27 UTC
piglit.spec.arb_fragment_layer_viewport.layer-gs-writes-in-range depends on atomic counters and suffers from this bug.
Comment 7 Tapani Pälli 2015-11-30 07:39:05 UTC
I've sent a test to the mailing list that might be related to this same issue. Adding syncs or sleeps does not seem to help there though.

http://lists.freedesktop.org/archives/piglit/2015-November/018227.html
Comment 8 Ian Romanick 2015-12-02 00:52:02 UTC
As another maddening data point... I've never been able to reproduce this on my ThinkPad X250.  I had thought it could be a kernel regression (I was on a quite old kernel), but I just ran the test 100 times in a loop without a single failure on 4.2.6-301.fc23.x86_64.

Perhaps it's only a problem with GBM runs?  I don't use GBM.
Comment 9 Francisco Jerez 2015-12-09 16:12:05 UTC
I've been trying to reproduce this failure on BDW today without success.  It doesn't seem to make a difference for me whether I use X or GBM to run the test case.

I noticed though that the DRM render ring code is missing a DC flush on batchbuffer submission, what could lead to a non-deterministic failure in this test case.  Can anyone try out the attached kernel patch and see if it helps?  If it does this should be considered a DRM bug.
Comment 10 Francisco Jerez 2015-12-09 16:12:54 UTC
Created attachment 120435 [details] [review]
gen7_flush_batch_dc_writes.patch
Comment 11 Mark Janes 2016-01-07 20:29:53 UTC
Curro's patch fixes the tests that are intermittent.

However, upgrading to Linux 4.3 (to apply the patch) produces an intermitten failures on piglit.spec.arb_shader_image_load_store.qualifiers.  I'll write that bug up.
Comment 12 Francisco Jerez 2016-01-14 03:03:31 UTC
(In reply to Mark Janes from comment #11)
> Curro's patch fixes the tests that are intermittent.
> 
> However, upgrading to Linux 4.3 (to apply the patch) produces an intermitten
> failures on piglit.spec.arb_shader_image_load_store.qualifiers.  I'll write
> that bug up.

Thanks, I've added your tested-by and sent the patch for review.
Comment 13 Jani Nikula 2016-01-15 09:53:00 UTC
Fix pushed to drm-intel-next-queued as

commit 965fd602a6436f689f4f2fe40a6789582778ccd5
Author: Francisco Jerez <currojerez@riseup.net>
Date:   Wed Jan 13 18:59:39 2016 -0800

    drm/i915: Make sure DC writes are coherent on flush.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.