Bug 99816 - [IVB][BAT] gem_sync/basic-store-all FAIL on CI
Summary: [IVB][BAT] gem_sync/basic-store-all FAIL on CI
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-02-14 19:54 UTC by Jani Saarinen
Modified: 2017-02-20 08:48 UTC (History)
1 user (show)

See Also:
i915 platform: IVB
i915 features: GEM/Other


Attachments
Use sync flush polling for the irq seqno barrier (2.37 KB, patch)
2017-02-15 09:13 UTC, Chris Wilson
no flags Details | Splinter Review

Description Jani Saarinen 2017-02-14 19:54:23 UTC
Test gem_sync/basic-store-all failed on CI
pass       -> FAIL       (fi-ivb-3770)

Result and logs:

Out	
IGT-Version: 1.17-gca2ba47 (x86_64) (Linux: 4.10.0-rc8-CI-Patchwork_3809+ x86_64)
Using Legacy submission , with semaphores
Completed 12288 cycles: 419.333 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [store_all+0x515]
  #2 [<unknown>+0x515]
Subtest basic-store-all: FAIL (5.393s)
Err	
(gem_sync:8385) CRITICAL: Test assertion failure function store_all, file gem_sync.c:690:
(gem_sync:8385) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:8385) CRITICAL: error: 1 != 0
Subtest basic-store-all failed.
**** DEBUG ****
(gem_sync:8385) DEBUG: Test requirement passed: num_engines
(gem_sync:8385) CRITICAL: Test assertion failure function store_all, file gem_sync.c:690:
(gem_sync:8385) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:8385) CRITICAL: error: 1 != 0
****  END  ****
Environment	
PIGLIT_SOURCE_DIR="/opt/igt/piglit" PIGLIT_PLATFORM="mixed_glx_egl"

Dmesg:
https://intel-gfx-ci.01.org/CI/Patchwork_3809/fi-ivb-3770/dmesg-during.log

And this test has actually same way failed once previously:
https://intel-gfx-ci.01.org/CI/CI_DRM_2106/fi-ivb-3770/igt@gem_sync@basic-store-all.htmlยจ

Out	
IGT-Version: 1.17-gfa45f58 (x86_64) (Linux: 4.10.0-rc3-CI-CI_DRM_2106+ x86_64)
Using Legacy submission , with semaphores
Completed 15360 cycles: 348.941 us
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [store_all+0x515]
  #2 [<unknown>+0x515]
Subtest basic-store-all: FAIL (5.447s)
Err	
(gem_sync:8230) CRITICAL: Test assertion failure function store_all, file gem_sync.c:690:
(gem_sync:8230) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:8230) CRITICAL: error: 1 != 0
Subtest basic-store-all failed.
**** DEBUG ****
(gem_sync:8230) DEBUG: Test requirement passed: num_engines
(gem_sync:8230) CRITICAL: Test assertion failure function store_all, file gem_sync.c:690:
(gem_sync:8230) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:8230) CRITICAL: error: 1 != 0
****  END  ****
Comment 1 Chris Wilson 2017-02-14 21:33:14 UTC
The test is doing its job and hitting the error it is hunting for. Pretty much the only way to prevent it is by increasing the delay between receiving the interrupt and checking the seqno. At the moment that delay is defined by an uncached mmio - but maybe we should try setting SyncFlush and polling until clear?
Comment 2 Chris Wilson 2017-02-15 09:01:48 UTC
Do we have an equivalent machine in farm2? Could you set it running gem_sync (the full set) and see how reproducible the missed interrupt is?
Comment 3 Jani Saarinen 2017-02-15 09:12:45 UTC
Not exactly but almost (4770s)
Comment 4 Chris Wilson 2017-02-15 09:13:23 UTC
Created attachment 129620 [details] [review]
Use sync flush polling for the irq seqno barrier

Back to something super heavyweight.
Comment 5 Jani Saarinen 2017-02-15 09:56:18 UTC
Wrong HW. there is same IVB 3770 there.
Comment 6 Tomi Sarvela 2017-02-15 10:17:13 UTC
There's IVB-3770 on farm2, Dell Optiplex (vs farm1 HP Pro). It doesn't seem to have any gem_sync failures on last 100 runs.
Comment 7 Jani Saarinen 2017-02-16 08:44:57 UTC
Chris, can you try this on trybot, as like this hard to know if this helps?
Comment 8 Chris Wilson 2017-02-16 08:51:43 UTC
The failure is rare that I don't expect trybot to give a clear indication of whether it is sufficient to prevent the missed interrupt. As it happens SyncFlush does not work well with hanging batches, so I'm trying a different approach.
Comment 9 Chris Wilson 2017-02-17 21:59:14 UTC
commit 8998567b51141f79309d1267640c919dfd23d3a4
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Feb 17 15:13:02 2017 +0000

    drm/i915: Defer declaration of missed-interrupt until the waiter is asleep

and earlier should indirectly help, and I expect to reduce the frequency of false positives. Marking as closed until we see it again.
Comment 10 Martin Peres 2017-02-20 08:48:57 UTC
(In reply to Chris Wilson from comment #9)
> commit 8998567b51141f79309d1267640c919dfd23d3a4
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Feb 17 15:13:02 2017 +0000
> 
>     drm/i915: Defer declaration of missed-interrupt until the waiter is
> asleep
> 
> and earlier should indirectly help, and I expect to reduce the frequency of
> false positives. Marking as closed until we see it again.

OK, will archive the temporary blacklist. However, for very intermittent failures like this, it would be nice if we could land a patch that would improve the debug-ability of the issue, so that next time we see it, the CI system would give us meaningful information about the bug and help us improve our code.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.