Summary: | [BAT][SHARDS] igt@gem_exec_(await@wide-all|await@wide-context|store@cachelines-bsd) - fail - Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 | ||
---|---|---|---|
Product: | DRI | Reporter: | Ander Conselvan de Oliveira <conselvan2> |
Component: | DRM/Intel | Assignee: | Francesco Balestrieri <francesco.balestrieri> |
Status: | CLOSED INVALID | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | medium | CC: | intel-gfx-bugs, marta.lofstedt, martin.peres |
Version: | unspecified | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | ALL | i915 features: | GEM/Other |
Description
Ander Conselvan de Oliveira
2017-02-28 13:45:46 UTC
In essence, it is a timing issue between raising the interrupt and processing the breadcrumb. Note sure just why commit 67b807a892302d184f6987b0c3cc2dc2505e4a2d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 27 20:58:50 2017 +0000 drm/i915: Delay disabling the user interrupt for breadcrumbs ... commit 8d769ea7bc16c34c9dc5143be021e943014c4cd1 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Feb 27 20:58:47 2017 +0000 drm/i915: Report both waiters and success from intel_engine_wakeup() would have had a significant (bad) impact on the irq-seqno timings, or why it is only showing up in a single test - though we haven't had that many runs yet. Usual test is the full gem_sync to see how regular the error is. Been trying to reproduce with a processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2500 CPU @ 3.30GHz stepping : 7 microcode : 0x1a to no avail. So far it has only occurred on one CI snb as well. Also checking with a vendor_id : GenuineIntel cpu family : 6 model : 42 model name : Intel(R) Core(TM) i5-2520M CPU @ 2.50GHz stepping : 7 microcode : 0x29 and no hits. Applied commit b66255f0f77902ef41b09163a6a092d2d905e151 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Mar 3 17:14:22 2017 +0000 drm/i915: Refactor wakeup of the next breadcrumb waiter Refactor the common task of updating the first_waiter, serialised with the interrupt handler. When we update the first_waiter, we also need to wakeup the new bottom-half in order to complete the actions that we may have delegated to it (such as checking the irq-seqno coherency or waking up other lower priority concurrent waiters). Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170303171422.4735-1-chris@chris-wilson.co.uk in an attempt to curtail some missed wakeups. Still very much visible, it happened on CI_DRM_2368-CI_DRM_2370. Failure rate 45/136 run(s) (33%) Raising the priority! gen6_seqno_barrier(struct intel_engine_cs *engine) { struct drm_i915_private *dev_priv = engine->i915; + int loop; /* Workaround to force correct ordering between irq and seqno writes on * ivb (and maybe also on snb) by reading from a CS register (like @@ -937,7 +938,8 @@ gen6_seqno_barrier(struct intel_engine_cs *engine) * take the spinlock to guard against concurrent cacheline access. */ spin_lock_irq(&dev_priv->uncore.lock); - POSTING_READ_FW(RING_ACTHD(engine->mmio_base)); + for (loop = 0; loop < 3; loop++) + POSTING_READ_FW(RING_ACTHD(engine->mmio_base)); spin_unlock_irq(&dev_priv->uncore.lock); was a bust. Adding reference to Chris' patch: https://patchwork.freedesktop.org/series/21876/ Seems like the above patch is not sufficient: "Second run and the miss came back. Boo. -Chris" Thread: http://www.spinics.net/lists/intel-gfx/msg124328.html Yep, keeps happening. Statistics: Failure rate 60/196 run(s) (30%) 00:02.0 0300: 8086:0102 (rev 09) has survived 70h without failure. Adding tag into "Whiteboard" field - ReadyForDev The bug still active *Status is correct *Platform is included *Feature is included *Priority and Severity correctly set Has been seen on pw: https://patchwork.freedesktop.org/series/21868/ but not after that. We will follow. Happened now on ci on CI_DRM_2638 Statistics: Failure rate 18/123 run(s) (14%) Hello, Seems that this bug is going to need more time to fix. Changing status to NEEDINFO. Once provided more information, please change to REOPEN. Thank you. Me and Tomi Sarvela have been trying quite extensively to reproduce this outside of CI. And without success. There is something inherently tied to CI environment that we haven't yet figured out. Also seen on fi-ivb-3520m, on igt@gem_sync@basic-store-all. Full logs: https://intel-gfx-ci.01.org/CI/Patchwork_5204/fi-ivb-3520m/igt@gem_sync@basic-store-all.html *** Bug 102405 has been marked as a duplicate of this bug. *** reproduced on: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3177/shard-hsw3/igt@gem_sync@basic-store-each.html Am I going mad or we haven't seen this for a while right? (It used to be 30-40%) Odd. (In reply to Chris Wilson from comment #22) > Am I going mad or we haven't seen this for a while right? (It used to be > 30-40%) Odd. Hmm, indeed! Marta, have you seen this error recently? https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3247/shard-snb2/igt@gem_sync@basic-each.html (gem_sync:2150) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:159: (gem_sync:2150) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:2150) CRITICAL: error: 1 != 0 Subtest basic-each failed. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3247/shard-hsw5/igt@gem_sync@basic-each.html (gem_sync:2736) CRITICAL: Test assertion failure function sync_ring, file gem_sync.c:159: (gem_sync:2736) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:2736) CRITICAL: error: 1 != 0 Subtest basic-each failed. https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3318/fi-ivb-3520m/igt@gem_sync@basic-store-all.html (gem_sync:3905) CRITICAL: Test assertion failure function store_all, file gem_sync.c:689: (gem_sync:3905) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_sync:3905) CRITICAL: error: 1 != 0 Subtest basic-store-all failed. Last seen: CI_DRM_3326: 2017-11-09 / 266 runs ago, time to close and archive Here we go again! https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_8694/shard-hsw4/igt@gem_exec_store@pages-vebox.html (gem_exec_store:4607) CRITICAL: Test assertion failure function store_cachelines, file ../tests/gem_exec_store.c:176: (gem_exec_store:4607) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_store:4607) CRITICAL: error: 64 != 0 Subtest pages-vebox failed. Is this really the same issue as what originally reported? At least the log in the last comment looks slightly different from the previous ones. It is the same; it's a hardware issue that we try to paper over. We will never fix it, just make it acceptably unlikely to happen (and our igt throw worst cases at the driver and complain if we succeed in detecting a miss, whereas typically we expect the continual ebb and flow of work to act as a flush and prevent it becoming noticeable.). I am more surprised that we haven't seen it for ~6 months. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_44/fi-kbl-guc/igt@gem_exec_await@wide-contexts.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_51/fi-skl-6770hq/igt@gem_exec_await@wide-contexts.html Also seen on BSW: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_57/fi-bsw-n3050/igt@gem_exec_await@wide-contexts.html (gem_exec_await:2473) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:204: (gem_exec_await:2473) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_await:2473) CRITICAL: error: 2 != 0 Subtest wide-contexts failed. (In reply to Martin Peres from comment #31) > Also seen on BSW: > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_57/fi-bsw-n3050/ > igt@gem_exec_await@wide-contexts.html > > (gem_exec_await:2473) CRITICAL: Test assertion failure function wide, file > ../tests/gem_exec_await.c:204: > (gem_exec_await:2473) CRITICAL: Failed assertion: > intel_detect_and_clear_missed_interrupts(fd) == 0 > (gem_exec_await:2473) CRITICAL: error: 2 != 0 > Subtest wide-contexts failed. Not quite. This particular error doesn't affect bsw (there are very strong assertions that we can not miss seqno). So this particular example needs a different explanation, and probably just an error in the timer callback. Failure rate is now 0.3%. I wonder if this qualifies as "acceptably unlikely to happen". I am at least lowering the priority to Medium. https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_100/fi-gdg-551/igt@gem_exec_await@wide-all.html (gem_exec_await:1332) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:207: (gem_exec_await:1332) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_await:1332) CRITICAL: error: 1 != 0 gdg is not G45, its i915g. However, this bug should be specifically kept to the expected error with gen6/gen7 (and at a push gen5, and/or gen8 with ringbuffer *not* execlists). gdg is unexpected (we have never observed it before on gen3 and believe the seqno write and MI_USER_INTERRUPT is ordered) and the miss looks genuine. So that probably relates to the use of the physical status page where we currently do not mark as snooped (unlike all other platforms). Still happening: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_130/fi-bdw-5557u/igt@gem_exec_await@wide-contexts.html Starting subtest: wide-contexts (gem_exec_await:1048) CRITICAL: Test assertion failure function wide, file ../tests/gem_exec_await.c:207: (gem_exec_await:1048) CRITICAL: Failed assertion: intel_detect_and_clear_missed_interrupts(fd) == 0 (gem_exec_await:1048) CRITICAL: error: 79 != 0 Subtest wide-contexts failed. (In reply to Martin Peres from comment #36) > Still happening: > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_130/fi-bdw-5557u/ > igt@gem_exec_await@wide-contexts.html > > Starting subtest: wide-contexts > (gem_exec_await:1048) CRITICAL: Test assertion failure function wide, file > ../tests/gem_exec_await.c:207: > (gem_exec_await:1048) CRITICAL: Failed assertion: > intel_detect_and_clear_missed_interrupts(fd) == 0 > (gem_exec_await:1048) CRITICAL: error: 79 != 0 > Subtest wide-contexts failed. And c34 are both due to a stall while processing the gem_exec_await workload, the same as bug 106680. Stalled long enough to trigger the missed interrupt warning, but not long enough to trigger the GPU hang check, I believe. Not seen for more than a month, should we close it? Last seen IGT_4691_full (1 month / 654 runs ago). But the occurrence of this bug is in-between 1 week-1 month. So, we can keep this open at least one more month. Sadly this bug has been hijacked from the gen6/gen7 known issue of lack of ordered interrupts. The bug used to be seen once every 105 runs in average. Now not seen for 1263 runs. The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.