94307 – [BAT ILK] gem_sync/basic-bsd fails / hangcheck timer elapsed

Bug 94307 - [BAT ILK] gem_sync/basic-bsd fails / hangcheck timer elapsed

Summary: [BAT ILK] gem_sync/basic-bsd fails / hangcheck timer elapsed

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	high critical
Assignee:	Gabriel Feceoru
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Duplicates (1):	78681 (view as bug list)
Depends on:
Blocks:

Reported:	2016-02-26 13:12 UTC by Tvrtko Ursulin
Modified:	2016-10-14 07:30 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	ILK
i915 features:	GEM/Other

Attachments
Only start retire worker when idle (9.38 KB, patch) 2016-02-27 20:18 UTC, Chris Wilson	no flags	Details \| Splinter Review
View All

Description Tvrtko Ursulin 2016-02-26 13:12:34 UTC

CI history suggests it is sporadic since CI_DRM_1062.

Results for igt@gem_sync@basic-bsd
Overview

Result: dmesg-fail

Back to summary
Details
Detail 	Value
Returncode 	99
Time 	0:00:10.400008
Stdout 	

IGT-Version: 1.13-gd627e30 (x86_64) (Linux: 4.5.0-rc5-gfxbench+ x86_64)
Completed 147082 cycles: 67.990 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [__real_main106+0x2f7]
  #2 [main+0x23]
  #3 [__libc_start_main+0xf0]
  #4 [_start+0x29]
Subtest basic-bsd: FAIL (10.004s)

Stderr 	

(gem_sync:5862) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:103:
(gem_sync:5862) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5862) CRITICAL: error: 4 != 0
Subtest basic-bsd failed.
**** DEBUG ****
(gem_sync:5862) ioctl-wrappers-DEBUG: Test requirement passed: gem_has_ring(fd, ring_id)
(gem_sync:5862) INFO: Completed 147082 cycles: 67.990 us
(gem_sync:5862) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:103:
(gem_sync:5862) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5862) CRITICAL: error: 4 != 0
****  END  ****

Environment 	

PIGLIT_SOURCE_DIR="/opt/igt/piglit" PIGLIT_PLATFORM="mixed_glx_egl"

Command 	/opt/igt/tests/gem_sync --run-subtest basic-bsd
dmesg 	

[  226.822964] [drm:i915_hangcheck_elapsed [i915]] *ERROR* Hangcheck timer elapsed... bsd ring idle

Comment 1 Chris Wilson 2016-02-26 20:19:39 UTC

History suggests that we never fixed it, the odd bug report filed here kept being closed as inconclusive.

It appears I have a ilk (x201s) machine that reasonably reliably reproduces it on -nightly but my breadcrumbs kernel seems clean. Promising.

Comment 2 Chris Wilson 2016-02-27 20:00:28 UTC

*** Bug 78681 has been marked as a duplicate of this bug. ***

Comment 3 Chris Wilson 2016-02-27 20:18:40 UTC

Created attachment 122009 [details] [review]
Only start retire worker when idle

The surprise winner.

Comment 4 Chris Wilson 2016-02-27 20:50:56 UTC

Just the error message is being suppressed and it barely affects the timings.

Comment 5 Chris Wilson 2016-02-27 22:11:04 UTC

static void
gen5_seqno_barrier(struct intel_engine_cs *ring)
{
        udelay(100);
}

incoherent MI_STORE* for the win.

Comment 6 Chris Wilson 2016-02-28 14:39:27 UTC

75us is as low as I can go and not start seeing missed seqno writes. An uncached mmio read + clflush + mfences on this machine is <500ns so the only explanation is that the GPU adds a significant buffer to the MI_STORE command that we cannot flush. It's also not documented whether poking SyncFlush flushes the internal MI_STORE buffer (seems unlikely).

Comment 7 Chris Wilson 2016-02-28 14:58:44 UTC

SyncFlush takes ~10-20us but does not wait for the MI_STORE to complete, i.e. the problem persists.

Comment 8 Tvrtko Ursulin 2016-03-03 17:42:04 UTC

Sighted on HSW igt@gem_storedw_loop@basic-blt as well now:

/archive/results/CI_IGT_test/Patchwork_1521/

Comment 9 Chris Wilson 2016-03-03 20:53:39 UTC

(In reply to Tvrtko Ursulin from comment #8)
> Sighted on HSW igt@gem_storedw_loop@basic-blt as well now:
> 
> /archive/results/CI_IGT_test/Patchwork_1521/

Likely related (hard to tell without actual access, there are some potential false positives in that hangcheck path as well) but since the hw is quite different, it would be best to keep them separate. The Ironlake bug is definitely that we don't apply a seqno-barrier for the BSD engine (but we do for the render path hence the discrepancy). Whereas for HSW, it is possible that the existing barrier is not long enough for that machine, or we have one of the false positives etc.

Comment 10 Daniela Prodan 2016-03-15 11:25:22 UTC

Since CI_DRM_1064, test gem_storedw_loop@basic-blt is passing on HSW [1]
On ILK, test gem_sync@basic-bsd still fails intermittently [2] in CI_DRM_1113 and CI_DRM_1125
On il5-i5-650 in Ro Jenkins, the gem_sync@basic-bsd test is passing for several builds, archive/results/CI_IGT_test/ilk1-i5-650.html

[1] /archive/results/CI_IGT_test/igt@gem_storedw_loop@basic-blt.html
[2] /archive/results/CI_IGT_test/igt@gem_sync@basic-bsd.html

Comment 11 Daniela Prodan 2016-03-15 11:27:51 UTC

assigning to Gabriel to investigate

Comment 12 Imre Deak 2016-03-16 15:25:58 UTC

The following could be related:
http://benchsrv.fi.intel.com/archive/results/CI_IGT_test/Patchwork_1535/ilk-hp8440p/html/ilk-hp8440p@Patchwork_1535@1/igt@gem_sync@basic-all.html

Comment 13 Gabriel Feceoru 2016-03-21 16:28:09 UTC

Any tricks to reproduce this on a ILK?

I tried on 2 different ILKs (i5-650), using something like:

$for i in {1..100}; do sudo ./gem_sync --run-subtest basic-bsd; done

Both showed a 100% success ratio. The same with basic-all.

Comment 14 Chris Wilson 2016-03-22 11:11:19 UTC

It's a low level timing issue that will be dependent upon the GPU clock speeds, CPU clock speeds and the memory clocks -- i.e. different machines will have different error rates. What you can try is restricting the CPU to c0, disable rc6, or keep swapping machines until you find one that is easy to reproduce!

Comment 15 Imre Deak 2016-04-22 13:12:14 UTC

Another instance for the basic-each subtest:
http://benchsrv.fi.intel.com//archive/results/CI_IGT_test/Patchwork_1967/ilk-hp8440p/html/ilk-hp8440p@Patchwork_1967@1/igt@gem_sync@basic-each.html

IGT-Version: 1.14-gf650aa2 (x86_64) (Linux: 4.6.0-rc4-CI-Patchwork_1967+ x86_64)
bsd completed 49152 cycles: 203.635 us
render completed 142336 cycles: 70.493 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [sync_ring+0x301]
  #2 [__real_main204+0xe8]
  #3 [main+0x23]
  #4 [__libc_start_main+0xf0]
  #5 [_start+0x29]
Subtest basic-each: FAIL (10.075s)

(gem_sync:5959) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:138:
(gem_sync:5959) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5959) CRITICAL: error: 4 != 0
Subtest basic-each failed.
**** DEBUG ****
(gem_sync:5959) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:138:
(gem_sync:5959) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:5959) CRITICAL: error: 4 != 0
****  END  ****

Comment 16 Jari Tahvanainen 2016-07-04 08:36:48 UTC

priority aligned for igt basic tests on gen7 to High+Critical

Comment 17 Chris Wilson 2016-07-05 20:10:29 UTC

Should be "fixed" now...

commit f8973c217f07903247d222ab92ad37e2529aff2e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jul 1 17:23:21 2016 +0100

    drm/i915: Add a delay between interrupt and inspecting the final seqno (ilk)

Comment 18 yann 2016-10-14 07:30:47 UTC

Closing as fixed -confirmed passing at least since CI_DRM_1649, more than 2 weeks ago (CI_IGT_test/fi-ilk-m540.html & CI_IGT_test/fi-ilk-650.html)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.