Bug 95372 - [BAT BYT] Sporadic failure from igt/gem_exec_flush@basic-batch-kernel-default-cmd
Summary: [BAT BYT] Sporadic failure from igt/gem_exec_flush@basic-batch-kernel-default...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: All Linux (All)
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2016-05-12 22:41 UTC by Matt Roper
Modified: 2017-07-24 22:41 UTC (History)
1 user (show)

See Also:
i915 platform: BYT
i915 features: GEM/Other


Attachments
dmesg fi-byt-n2820 (4.46 MB, text/plain)
2016-05-13 12:25 UTC, Daniela Prodan
no flags Details
dmesg ro-byt-n2820 (4.03 MB, text/plain)
2016-05-13 12:25 UTC, Daniela Prodan
no flags Details

Description Matt Roper 2016-05-12 22:41:33 UTC
This test seems to fail sporadically with a couple different failure errors.  The most common one is:

  (gem_exec_flush:6041) ioctl-wrappers-CRITICAL: Test assertion failure function gem_execbuf, file ioctl_wrappers.c:589:
  (gem_exec_flush:6041) ioctl-wrappers-CRITICAL: Failed assertion: __gem_execbuf(fd, execbuf) == 0
  (gem_exec_flush:6041) ioctl-wrappers-CRITICAL: error: -22 != 0

But looking through the CI history, it appears there's also sometimes:

  (gem_exec_flush:6131) CRITICAL: Test assertion failure function batch, file gem_exec_flush.c:456:
  (gem_exec_flush:6131) CRITICAL: Failed assertion: map[i] == cycles + i
  (gem_exec_flush:6131) CRITICAL: error: 0xabcdabcd != 0x3

CI history:
  /archive/results/CI_IGT_test/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html
Comment 1 Chris Wilson 2016-05-13 10:15:07 UTC
The only problem here is the sporadic failure - and that is mostly due to the overhead of the CI kernels hiding the issue. Since we are under severe time constraints for BAT, making the tests longer to improve detection rates is also problematic. Stuck between a rock and a hard place!
Comment 2 Daniela Prodan 2016-05-13 12:24:39 UTC
This test it fails quite often on BYT:

/archive/results/CI_IGT_test/RO_CI_DRM_365/fi-byt-n2820/html/fi-byt-n2820@RO_CI_DRM_365@1/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html

/archive/results/CI_IGT_test/RO_CI_DRM_365/ro-byt-n2820/html/ro-byt-n2820@RO_CI_DRM_365@1/igt@gem_exec_flush@basic-batch-kernel-default-cmd.html

Attaching also dmesg logs
Comment 3 Daniela Prodan 2016-05-13 12:25:13 UTC
Created attachment 123671 [details]
dmesg fi-byt-n2820
Comment 4 Daniela Prodan 2016-05-13 12:25:48 UTC
Created attachment 123672 [details]
dmesg ro-byt-n2820
Comment 6 Dave Gordon 2016-06-07 07:27:01 UTC
The second of the dmesg logs that Daniela posted contains the line:

[  313.349534] [drm:i915_parse_cmds] CMD: Command length exceeds batch length: 0x7FDEE770 length=114 batchlen=4

I can't see anywhere in the i-g-t tests that submits such a batch; firstly, the length is not a multiple of 8, whereas we normally pad them to an even DWord, and secondly, that hex number doesn't appear to be a valid instruction.

Is the parser perhaps picking up undefined data? That would explain why we see these failures only on BYT, and only intermittently.

.Dave.
Comment 7 Chris Wilson 2016-06-07 08:11:36 UTC
Yes. For the cmdparser there are 2 sources of incoherency: writes from the CPU cache to memory are not being ordered with mfence; clflush; mfence and secondly writes through the GTT are not immediately coherent. More details, ideas and patches, on the mailing list from last year and other bugs that are even older.
Comment 8 Chris Wilson 2016-08-19 09:29:01 UTC
commit 3b5724d702ef24ee41ca008a1fab1cf94f3d31b5
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Aug 18 17:16:49 2016 +0100

    drm/i915: Wait for writes through the GTT to land before reading back
    
    If we quickly switch from writing through the GTT to a read of the
    physical page directly with the CPU (e.g. performing relocations through
    the GTT and then running the command parser), we can observe that the
    writes are not visible to the CPU. It is not a coherency problem, as
    extensive investigations with clflush have demonstrated, but a mere
    timing issue - we have to wait for the GTT to complete it's write before
    we start our read from the CPU.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.