Bug 94307

Summary: [BAT ILK] gem_sync/basic-bsd fails / hangcheck timer elapsed
Product: DRI Reporter: Tvrtko Ursulin <tvrtko.ursulin>
Component: DRM/IntelAssignee: Gabriel Feceoru <gabriel.feceoru>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: huax.lu, intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: ILK i915 features: GEM/Other
Attachments:
Description Flags
Only start retire worker when idle none

Description Tvrtko Ursulin 2016-02-26 13:12:34 UTC
CI history suggests it is sporadic since CI_DRM_1062.

Results for igt@gem_sync@basic-bsd
Overview

Result: dmesg-fail

Back to summary
Details
Detail 	Value
Returncode 	99
Time 	0:00:10.400008
Stdout 	

IGT-Version: 1.13-gd627e30 (x86_64) (Linux: 4.5.0-rc5-gfxbench+ x86_64)
Completed 147082 cycles: 67.990 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [__real_main106+0x2f7]
  #2 [main+0x23]
  #3 [__libc_start_main+0xf0]
  #4 [_start+0x29]
Subtest basic-bsd: FAIL (10.004s)

Stderr 	

(gem_sync:5862) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:103:
(gem_sync:5862) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5862) CRITICAL: error: 4 != 0
Subtest basic-bsd failed.
**** DEBUG ****
(gem_sync:5862) ioctl-wrappers-DEBUG: Test requirement passed: gem_has_ring(fd, ring_id)
(gem_sync:5862) INFO: Completed 147082 cycles: 67.990 us
(gem_sync:5862) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:103:
(gem_sync:5862) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5862) CRITICAL: error: 4 != 0
****  END  ****

Environment 	

PIGLIT_SOURCE_DIR="/opt/igt/piglit" PIGLIT_PLATFORM="mixed_glx_egl"

Command 	/opt/igt/tests/gem_sync --run-subtest basic-bsd
dmesg 	

[  226.822964] [drm:i915_hangcheck_elapsed [i915]] *ERROR* Hangcheck timer elapsed... bsd ring idle
Comment 1 Chris Wilson 2016-02-26 20:19:39 UTC
History suggests that we never fixed it, the odd bug report filed here kept being closed as inconclusive.

It appears I have a ilk (x201s) machine that reasonably reliably reproduces it on -nightly but my breadcrumbs kernel seems clean. Promising.
Comment 2 Chris Wilson 2016-02-27 20:00:28 UTC
*** Bug 78681 has been marked as a duplicate of this bug. ***
Comment 3 Chris Wilson 2016-02-27 20:18:40 UTC
Created attachment 122009 [details] [review]
Only start retire worker when idle

The surprise winner.
Comment 4 Chris Wilson 2016-02-27 20:50:56 UTC
Just the error message is being suppressed and it barely affects the timings.
Comment 5 Chris Wilson 2016-02-27 22:11:04 UTC
static void
gen5_seqno_barrier(struct intel_engine_cs *ring)
{
        udelay(100);
}

incoherent MI_STORE* for the win.
Comment 6 Chris Wilson 2016-02-28 14:39:27 UTC
75us is as low as I can go and not start seeing missed seqno writes. An uncached mmio read + clflush + mfences on this machine is <500ns so the only explanation is that the GPU adds a significant buffer to the MI_STORE command that we cannot flush. It's also not documented whether poking SyncFlush flushes the internal MI_STORE buffer (seems unlikely).
Comment 7 Chris Wilson 2016-02-28 14:58:44 UTC
SyncFlush takes ~10-20us but does not wait for the MI_STORE to complete, i.e. the problem persists.
Comment 8 Tvrtko Ursulin 2016-03-03 17:42:04 UTC
Sighted on HSW igt@gem_storedw_loop@basic-blt as well now:

/archive/results/CI_IGT_test/Patchwork_1521/
Comment 9 Chris Wilson 2016-03-03 20:53:39 UTC
(In reply to Tvrtko Ursulin from comment #8)
> Sighted on HSW igt@gem_storedw_loop@basic-blt as well now:
> 
> /archive/results/CI_IGT_test/Patchwork_1521/

Likely related (hard to tell without actual access, there are some potential false positives in that hangcheck path as well) but since the hw is quite different, it would be best to keep them separate. The Ironlake bug is definitely that we don't apply a seqno-barrier for the BSD engine (but we do for the render path hence the discrepancy). Whereas for HSW, it is possible that the existing barrier is not long enough for that machine, or we have one of the false positives etc.
Comment 10 Daniela Prodan 2016-03-15 11:25:22 UTC
Since CI_DRM_1064, test gem_storedw_loop@basic-blt is passing on HSW [1]
On ILK, test gem_sync@basic-bsd still fails intermittently [2] in CI_DRM_1113 and CI_DRM_1125
On il5-i5-650 in Ro Jenkins, the gem_sync@basic-bsd test is passing for several builds, archive/results/CI_IGT_test/ilk1-i5-650.html

[1] /archive/results/CI_IGT_test/igt@gem_storedw_loop@basic-blt.html
[2] /archive/results/CI_IGT_test/igt@gem_sync@basic-bsd.html
Comment 11 Daniela Prodan 2016-03-15 11:27:51 UTC
assigning to Gabriel to investigate
Comment 13 Gabriel Feceoru 2016-03-21 16:28:09 UTC
Any tricks to reproduce this on a ILK?

I tried on 2 different ILKs (i5-650), using something like:

$for i in {1..100}; do sudo ./gem_sync --run-subtest basic-bsd; done

Both showed a 100% success ratio. The same with basic-all.
Comment 14 Chris Wilson 2016-03-22 11:11:19 UTC
It's a low level timing issue that will be dependent upon the GPU clock speeds, CPU clock speeds and the memory clocks -- i.e. different machines will have different error rates. What you can try is restricting the CPU to c0, disable rc6, or keep swapping machines until you find one that is easy to reproduce!
Comment 15 Imre Deak 2016-04-22 13:12:14 UTC
Another instance for the basic-each subtest:
http://benchsrv.fi.intel.com//archive/results/CI_IGT_test/Patchwork_1967/ilk-hp8440p/html/ilk-hp8440p@Patchwork_1967@1/igt@gem_sync@basic-each.html

IGT-Version: 1.14-gf650aa2 (x86_64) (Linux: 4.6.0-rc4-CI-Patchwork_1967+ x86_64)
bsd completed 49152 cycles: 203.635 us
render completed 142336 cycles: 70.493 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [sync_ring+0x301]
  #2 [__real_main204+0xe8]
  #3 [main+0x23]
  #4 [__libc_start_main+0xf0]
  #5 [_start+0x29]
Subtest basic-each: FAIL (10.075s)

(gem_sync:5959) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:138:
(gem_sync:5959) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5959) CRITICAL: error: 4 != 0
Subtest basic-each failed.
**** DEBUG ****
(gem_sync:5959) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:138:
(gem_sync:5959) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5959) CRITICAL: error: 4 != 0
****  END  ****
Comment 16 Jari Tahvanainen 2016-07-04 08:36:48 UTC
priority aligned for igt basic tests on gen7 to High+Critical
Comment 17 Chris Wilson 2016-07-05 20:10:29 UTC
Should be "fixed" now...

commit f8973c217f07903247d222ab92ad37e2529aff2e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 1 17:23:21 2016 +0100

    drm/i915: Add a delay between interrupt and inspecting the final seqno (ilk)
Comment 18 yann 2016-10-14 07:30:47 UTC
Closing as fixed -confirmed passing at least since CI_DRM_1649, more than 2 weeks ago (CI_IGT_test/fi-ilk-m540.html & CI_IGT_test/fi-ilk-650.html)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.