Bug 100007 - [BAT][SHARDS] igt@gem_exec_(await@wide-all|await@wide-context|store@cachelines-bsd) - fail - Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
Summary: [BAT][SHARDS] igt@gem_exec_(await@wide-all|await@wide-context|store@cacheline...
Status: RESOLVED INVALID
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: Other All
: medium critical
Assignee: Francesco Balestrieri
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 102405 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-02-28 13:45 UTC by Ander Conselvan de Oliveira
Modified: 2018-12-05 10:06 UTC (History)
3 users (show)

See Also:
i915 platform: ALL
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Ander Conselvan de Oliveira 2017-02-28 13:45:46 UTC
Out:
	
IGT-Version: 1.17-ga9dcd63 (x86_64) (Linux: 4.10.0-CI-CI_DRM_2251+ x86_64)
Has LLC? yes
CPU features: x86-64, sse2, sse3, ssse3, sse4.1, sse4.2, avx
Child[3]: 10240 cycles
Child[5]: 21504 cycles
Child[1]: 21504 cycles
Child[4]: 21504 cycles
Child[6]: 21504 cycles
Child[0]: 21504 cycles
Child[2]: 21504 cycles
Child[7]: 21504 cycles
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [batch+0x61b]
  #2 [<unknown>+0x61b]
Subtest basic-batch-kernel-default-uc: FAIL (5.637s)
Test requirement not met in function __real_main549, file gem_exec_flush.c:609:
Test requirement: !(gen == 6 && e->exec_id == I915_EXEC_BSD)
MI_STORE_DATA broken on gen6 bsd
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1549:
Test requirement: gem_has_ring(fd, ring)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1549:
Test requirement: gem_has_ring(fd, ring)
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:1549:
Test requirement: gem_has_ring(fd, ring)

Err:

(gem_exec_flush:7130) CRITICAL: Test assertion failure function batch, file gem_exec_flush.c:536:
(gem_exec_flush:7130) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_flush:7130) CRITICAL: error: 1 != 0
Subtest basic-batch-kernel-default-uc failed.
**** DEBUG ****
(gem_exec_flush:7130) CRITICAL: Test assertion failure function batch, file gem_exec_flush.c:536:
(gem_exec_flush:7130) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_flush:7130) CRITICAL: error: 1 != 0
****  END  ****


https://intel-gfx-ci.01.org/CI/CI_DRM_2249/fi-snb-2600/igt@gem_exec_flush@basic-batch-kernel-default-uc.html

https://intel-gfx-ci.01.org/CI/CI_DRM_2251/fi-snb-2600/igt@gem_exec_flush@basic-batch-kernel-default-uc.html
Comment 1 Chris Wilson 2017-02-28 15:22:25 UTC
In essence, it is a timing issue between raising the interrupt and processing the breadcrumb. Note sure just why

commit 67b807a892302d184f6987b0c3cc2dc2505e4a2d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 27 20:58:50 2017 +0000

    drm/i915: Delay disabling the user interrupt for breadcrumbs

...

commit 8d769ea7bc16c34c9dc5143be021e943014c4cd1
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Feb 27 20:58:47 2017 +0000

    drm/i915: Report both waiters and success from intel_engine_wakeup()

would have had a significant (bad) impact on the irq-seqno timings, or why it is only showing up in a single test - though we haven't had that many runs yet.

Usual test is the full gem_sync to see how regular the error is.
Comment 2 Chris Wilson 2017-03-01 17:47:10 UTC
Been trying to reproduce with a
processor	: 3
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz
stepping	: 7
microcode	: 0x1a
to no avail. So far it has only occurred on one CI snb as well.
Comment 3 Chris Wilson 2017-03-02 07:43:20 UTC
Also checking with a 
vendor_id	: GenuineIntel
cpu family	: 6
model		: 42
model name	: Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz
stepping	: 7
microcode	: 0x29
and no hits.
Comment 4 Chris Wilson 2017-03-03 20:57:48 UTC
Applied
commit b66255f0f77902ef41b09163a6a092d2d905e151
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Mar 3 17:14:22 2017 +0000

    drm/i915: Refactor wakeup of the next breadcrumb waiter
    
    Refactor the common task of updating the first_waiter, serialised with
    the interrupt handler. When we update the first_waiter, we also need to
    wakeup the new bottom-half in order to complete the actions that we may
    have delegated to it (such as checking the irq-seqno coherency or waking
    up other lower priority concurrent waiters).
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170303171422.4735-1-chris@chris-wilson.co.uk

in an attempt to curtail some missed wakeups.
Comment 5 Martin Peres 2017-03-20 11:42:42 UTC
Still very much visible, it happened on CI_DRM_2368-CI_DRM_2370.
Comment 7 Martin Peres 2017-03-23 09:59:17 UTC
Failure rate 45/136 run(s) (33%)

Raising the priority!
Comment 8 Chris Wilson 2017-03-27 15:06:23 UTC
 gen6_seqno_barrier(struct intel_engine_cs *engine)
 {
        struct drm_i915_private *dev_priv = engine->i915;
+       int loop;
 
        /* Workaround to force correct ordering between irq and seqno writes on
         * ivb (and maybe also on snb) by reading from a CS register (like
@@ -937,7 +938,8 @@ gen6_seqno_barrier(struct intel_engine_cs *engine)
         * take the spinlock to guard against concurrent cacheline access.
         */
        spin_lock_irq(&dev_priv->uncore.lock);
-       POSTING_READ_FW(RING_ACTHD(engine->mmio_base));
+       for (loop = 0; loop < 3; loop++)
+               POSTING_READ_FW(RING_ACTHD(engine->mmio_base));
        spin_unlock_irq(&dev_priv->uncore.lock);

was a bust.
Comment 9 yann 2017-03-27 16:37:04 UTC
Adding reference to Chris' patch: https://patchwork.freedesktop.org/series/21876/
Comment 10 Martin Peres 2017-03-30 11:09:03 UTC
Seems like the above patch is not sufficient:
"Second run and the miss came back. Boo.
-Chris"

Thread: http://www.spinics.net/lists/intel-gfx/msg124328.html
Comment 11 Jani Saarinen 2017-03-31 10:01:49 UTC
Yep, keeps happening. 
Statistics: Failure rate 60/196 run(s) (30%)
Comment 12 Mika Kuoppala 2017-04-24 08:29:49 UTC
00:02.0 0300: 8086:0102 (rev 09) has survived 70h without failure.
Comment 13 Ricardo 2017-05-09 16:44:08 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
Comment 14 Jani Saarinen 2017-05-19 11:45:21 UTC
Has been seen on pw: https://patchwork.freedesktop.org/series/21868/
but not after that. We will follow.
Comment 15 Jani Saarinen 2017-05-19 15:52:07 UTC
Happened now on ci on CI_DRM_2638
Comment 16 Jani Saarinen 2017-05-29 08:31:56 UTC
Statistics: Failure rate 18/123 run(s) (14%)
Comment 17 Elizabeth 2017-06-27 20:54:40 UTC
Hello, 
Seems that this bug is going to need more time to fix. Changing status to NEEDINFO. Once provided more information, please change to REOPEN. Thank you.
Comment 18 Mika Kuoppala 2017-07-05 12:28:11 UTC
Me and Tomi Sarvela have been trying quite extensively to reproduce this outside of CI. And without success.

There is something inherently tied to CI environment that we haven't
yet figured out.
Comment 19 Martin Peres 2017-07-18 08:14:04 UTC
Also seen on fi-ivb-3520m, on igt@gem_sync@basic-store-all.

Full logs: https://intel-gfx-ci.01.org/CI/Patchwork_5204/fi-ivb-3520m/igt@gem_sync@basic-store-all.html
Comment 20 Martin Peres 2017-08-25 11:19:06 UTC
*** Bug 102405 has been marked as a duplicate of this bug. ***
Comment 22 Chris Wilson 2017-10-15 21:02:31 UTC
Am I going mad or we haven't seen this for a while right? (It used to be 30-40%) Odd.
Comment 23 Martin Peres 2017-10-15 22:07:07 UTC
(In reply to Chris Wilson from comment #22)
> Am I going mad or we haven't seen this for a while right? (It used to be
> 30-40%) Odd.

Hmm, indeed!

Marta, have you seen this error recently?
Comment 24 Marta Löfstedt 2017-10-17 11:04:56 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3247/shard-snb2/igt@gem_sync@basic-each.html

(gem_sync:2150) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:159:
(gem_sync:2150) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:2150) CRITICAL: error: 1 != 0
Subtest basic-each failed.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3247/shard-hsw5/igt@gem_sync@basic-each.html

(gem_sync:2736) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:159:
(gem_sync:2736) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:2736) CRITICAL: error: 1 != 0
Subtest basic-each failed.
Comment 25 Marta Löfstedt 2017-11-07 12:05:50 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-ivb-3520m/igt@gem_sync@basic-store-all.html

(gem_sync:3905) CRITICAL: Test assertion failure function store_all, file gem_sync.c:689:
(gem_sync:3905) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_sync:3905) CRITICAL: error: 1 != 0
Subtest basic-store-all failed.
Comment 26 Marta Löfstedt 2017-12-15 09:19:13 UTC
Last seen: CI_DRM_3326: 2017-11-09 / 266 runs ago, time to close and archive
Comment 27 Martin Peres 2018-04-17 08:15:41 UTC
Here we go again!

https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8694/shard-hsw4/igt@gem_exec_store@pages-vebox.html

(gem_exec_store:4607) CRITICAL: Test assertion failure function store_cachelines, file ../tests/gem_exec_store.c:176:
(gem_exec_store:4607) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_store:4607) CRITICAL: error: 64 != 0
Subtest pages-vebox failed.
Comment 28 Francesco Balestrieri 2018-05-22 10:28:39 UTC
Is this really the same issue as what originally reported? At least the log in the last comment looks slightly different from the previous ones.
Comment 29 Chris Wilson 2018-05-22 10:35:08 UTC
It is the same; it's a hardware issue that we try to paper over. We will never fix it, just make it acceptably unlikely to happen (and our igt throw worst cases at the driver and complain if we succeed in detecting a miss, whereas typically we expect the continual ebb and flow of work to act as a flush and prevent it becoming noticeable.).

I am more surprised that we haven't seen it for ~6 months.
Comment 31 Martin Peres 2018-06-08 08:24:41 UTC
Also seen on BSW: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_57/fi-bsw-n3050/igt@gem_exec_await@wide-contexts.html

(gem_exec_await:2473) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:204:
(gem_exec_await:2473) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_await:2473) CRITICAL: error: 2 != 0
Subtest wide-contexts failed.
Comment 32 Chris Wilson 2018-06-08 19:56:50 UTC
(In reply to Martin Peres from comment #31)
> Also seen on BSW:
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_57/fi-bsw-n3050/
> igt@gem_exec_await@wide-contexts.html
> 
> (gem_exec_await:2473) CRITICAL: Test assertion failure function wide, file
> ../tests/gem_exec_await.c:204:
> (gem_exec_await:2473) CRITICAL: Failed assertion:
> intel_detect_and_clear_missed_interrupts(fd) == 0
> (gem_exec_await:2473) CRITICAL: error: 2 != 0
> Subtest wide-contexts failed.

Not quite. This particular error doesn't affect bsw (there are very strong assertions that we can not miss seqno). So this particular example needs a different explanation, and probably just an error in the timer callback.
Comment 33 Francesco Balestrieri 2018-07-02 09:54:57 UTC
Failure rate is now 0.3%. I wonder if this qualifies as "acceptably unlikely to happen".

I am at least lowering the priority to Medium.
Comment 34 Martin Peres 2018-09-03 15:00:36 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_100/fi-gdg-551/igt@gem_exec_await@wide-all.html

(gem_exec_await:1332) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:207:
(gem_exec_await:1332) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_await:1332) CRITICAL: error: 1 != 0
Comment 35 Chris Wilson 2018-09-03 15:08:27 UTC
gdg is not G45, its i915g. However, this bug should be specifically kept to the expected error with gen6/gen7 (and at a push gen5, and/or gen8 with ringbuffer *not* execlists).

gdg is unexpected (we have never observed it before on gen3 and believe the seqno write and MI_USER_INTERRUPT is ordered) and the miss looks genuine. So that probably relates to the use of the physical status page where we currently do not mark as snooped (unlike all other platforms).
Comment 36 Martin Peres 2018-10-18 07:58:21 UTC
Still happening: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_130/fi-bdw-5557u/igt@gem_exec_await@wide-contexts.html

Starting subtest: wide-contexts
(gem_exec_await:1048) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:207:
(gem_exec_await:1048) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0
(gem_exec_await:1048) CRITICAL: error: 79 != 0
Subtest wide-contexts failed.
Comment 37 Chris Wilson 2018-10-18 08:29:58 UTC
(In reply to Martin Peres from comment #36)
> Still happening:
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_130/fi-bdw-5557u/
> igt@gem_exec_await@wide-contexts.html
> 
> Starting subtest: wide-contexts
> (gem_exec_await:1048) CRITICAL: Test assertion failure function wide, file
> ../tests/gem_exec_await.c:207:
> (gem_exec_await:1048) CRITICAL: Failed assertion:
> intel_detect_and_clear_missed_interrupts(fd) == 0
> (gem_exec_await:1048) CRITICAL: error: 79 != 0
> Subtest wide-contexts failed.

And c34 are both due to a stall while processing the gem_exec_await workload, the same as bug 106680. Stalled long enough to trigger the missed interrupt warning, but not long enough to trigger the GPU hang check, I believe.
Comment 38 Francesco Balestrieri 2018-11-23 10:42:25 UTC
Not seen for more than a month, should we close it?
Comment 39 Lakshmi 2018-11-26 08:32:49 UTC
Last seen IGT_4691_full (1 month / 654 runs ago).
But the occurrence of this bug is in-between 1 week-1 month. So, we can keep this open at least one more month.
Comment 40 Chris Wilson 2018-12-05 10:06:46 UTC
Sadly this bug has been hijacked from the gen6/gen7 known issue of lack of ordered interrupts.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.