Bug 109673

Summary: [CI][DRMTIP] Random tests - timeout - Received signal SIGQUIT
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs, jani.saarinen
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ICL i915 features: GEM/Other

Description Lakshmi 2019-02-19 09:16:46 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_221/fi-icl-u2/igt@gem_tiled_pread_pwrite.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_221/fi-icl-u2/igt@gem_pwrite_pread@uncached-pwrite-blt-gtt_mmap-performance.html

Received signal SIGQUIT.
Stack trace: 
 #0 [fatal_sig_handler+0xd5]
 #1 [killpg+0x40]
 #2 [memcpy_from_wc_sse41+0x184]
 #3 [copy_wc_page+0x28]
 #4 [__real_main108+0x1c8]
 #5 [main+0x44]
 #6 [__libc_start_main+0xe7]
 #7 [_start+0x2a]
Comment 2 Chris Wilson 2019-02-19 09:26:48 UTC
big-copy and gtt-mmap-performance are expected to be fairly slow, so it's not surprising that they may timeout (and so probably shouldn't conflate bug reports)

gem_tiled_pread_pwrite should only take about 10s. It's pretty much as if the cpu throttled itself. There's nothing here that would vary between runs.
Comment 3 Francesco Balestrieri 2019-02-23 11:24:03 UTC
Won't fix then?
Comment 4 Chris Wilson 2019-02-23 11:37:19 UTC
These sporadic pauses shouldn't be happening, and I don't know why they are happening. I think they are external to i915, but I just can't be sure...

The slow test cases that only exist to give perf metrics we can (and will) drop from CI (in exchange for dedicated perf metrics???) but there's a wider issue here that seems to be affecting icl at large.
Comment 5 CI Bug Log 2019-02-25 10:44:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL:  igt@gem_* - timeout - Received signal SIGQUIT -}
{+ ICL:  igt@gem_* - timeout - Received signal SIGQUIT +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_229/fi-icl-u2/igt@gem_linear_blits@interruptible.html
Comment 6 CI Bug Log 2019-02-28 11:20:36 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL:  igt@gem_* - timeout - Received signal SIGQUIT -}
{+ ICL:  igt@gem_* - timeout - Received signal SIGQUIT +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5668/fi-icl-y/igt@gem_mmap_gtt@basic-small-copy.html
Comment 8 CI Bug Log 2019-03-01 10:15:17 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL:  igt@gem_* - timeout - Received signal SIGQUIT -}
{+ ICL:  igt@ random tests - timeout - Received signal SIGQUIT +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4862/fi-icl-u3/igt@kms_plane@pixel-format-pipe-c-planes-source-clamping.html
Comment 9 CI Bug Log 2019-03-01 10:16:38 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL:  igt@ random tests - timeout - Received signal SIGQUIT -}
{+ ICL:  Random tests - timeout - Received signal SIGQUIT +}

 No new failures caught with the new filter
Comment 11 Lakshmi 2019-03-04 07:41:04 UTC
(In reply to CI Bug Log from comment #10)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- ICL:  Random tests - timeout - Received signal SIGQUIT -}
> {+ ICL:  Random tests - timeout - Received signal SIGQUIT +}
> 
> New failures caught by the filter:
> 
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_235/fi-icl-u3/
> igt@gem_pwrite@big-gtt-fbr.html
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_235/fi-icl-y/
> igt@sw_sync@sync_expired_merge.html
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_235/fi-icl-y/
> igt@gem_mmap_gtt@forked-big-copy-odd.html
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_234/fi-icl-u3/
> igt@gem_tiled_fence_blits@normal.html
> *
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_234/fi-icl-u3/
> igt@kms_plane@pixel-format-pipe-b-planes-source-clamping.html

These failures are timeout.
Comment 13 Martin Peres 2019-04-23 12:46:14 UTC
(In reply to Chris Wilson from comment #4)
> These sporadic pauses shouldn't be happening, and I don't know why they are
> happening. I think they are external to i915, but I just can't be sure...
> 
> The slow test cases that only exist to give perf metrics we can (and will)
> drop from CI (in exchange for dedicated perf metrics???) but there's a wider
> issue here that seems to be affecting icl at large.

There was an IRQ storm caused by a BIOS issue. Now we only see the issue on icl-y.

Jani, can you check if we disabled the faulty i2c controler on fi-icl-y, since we cannot update the BIOS?
Comment 14 Jani Saarinen 2019-04-23 14:22:06 UTC
HI,
NO we have not updated anything on BIOS on icl-y, should we?
Comment 15 Jani Saarinen 2019-04-24 19:17:00 UTC
We should update BIOS here too.
Comment 16 Jani Saarinen 2019-04-26 11:45:40 UTC
We did not update BIOS but we checked with Core team member that there was IRQ storm also on this machine and workaround was taken into use now (disable some BIOS setting) to get rid off this IRQ storm. Hopefully now ICL-Y also works more reliably.
Comment 17 Jani Saarinen 2019-04-26 15:33:58 UTC
So on that: On ICLY these now disabled: I2C4 and I2C5
Comment 18 Francesco Balestrieri 2019-04-29 16:46:17 UTC
And not seen after that. Of course it's been only three days, so too early to celebrate.
Comment 19 Chris Wilson 2019-05-15 21:08:40 UTC
The SIGQUIT are worth writing off as

commit 3970564940ba0322bcefce7fd8fd35c2b85846bf
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue May 7 13:11:08 2019 +0100

    drm/i915: Stop spinning for DROP_IDLE (debugfs/i915_drop_caches)
    
    If the user is racing a call to debugfs/i915_drop_caches with ongoing
    submission from another thread/process, we may never end up idling the
    GPU and be uninterruptibly spinning in debugfs/i915_drop_caches trying
    to catch an idle moment.
    
    Just flush the work once, that should be enough to park the system under
    correct conditions. Outside of those we either have a driver bug or the
    user is racing themselves. Sadly, because the user may be provoking the
    unwanted situation we can't put a warn here to attract attention to a
    probable bug.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190507121108.18377-4-chris@chris-wilson.co.uk

Unless any remain...
Comment 20 Francesco Balestrieri 2019-06-03 14:39:44 UTC
CI is still reporting this error, see e.g. 

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6179/re-icl-u/igt@gem_mmap_gtt@forked-medium-copy-odd.html

Should we reopen this or is it another issue?
Comment 21 Chris Wilson 2019-06-03 14:43:20 UTC
(In reply to Francesco Balestrieri from comment #20)
> CI is still reporting this error, see e.g. 
> 
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6179/re-icl-u/
> igt@gem_mmap_gtt@forked-medium-copy-odd.html
> 
> Should we reopen this or is it another issue?

That's a very very particular issue and not random at all. I thought we had it logged already.
Comment 22 Francesco Balestrieri 2019-06-03 14:46:51 UTC
OK. For completeness, there is also

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6179/re-icl-u/igt@gem_mmap_gtt@forked-big-copy-xy.html
Comment 25 Martin Peres 2019-09-17 13:21:34 UTC
(In reply to Martin Peres from comment #24)
> Still happening:
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_367/fi-icl-u4/
> igt@perf_pmu@cpu-hotplug.html

Nevermind, this is another bug!
Comment 26 CI Bug Log 2019-09-17 13:21:45 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.