Bug 109677 - [CI][SHARDS] igt@gem_mmap_gtt@hang - fail - Timed out waiting for children
Summary: [CI][SHARDS] igt@gem_mmap_gtt@hang - fail - Timed out waiting for children
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2019-02-19 13:27 UTC by Lakshmi
Modified: 2019-11-22 20:16 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other


Attachments

Description Lakshmi 2019-02-19 13:27:08 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_224/fi-icl-u3/igt@gem_mmap_gtt@hang.html

Starting subtest: hang
Subtest hang failed.
**** DEBUG ****
(gem_mmap_gtt:2672) drmtest-DEBUG: Test requirement passed: is_i915_device(fd) && has_known_intel_chipset(fd)
(gem_mmap_gtt:2672) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_mmap_gtt:2672) ioctl_wrappers-DEBUG: Test requirement passed: dir >= 0
(gem_mmap_gtt:2672) ioctl_wrappers-DEBUG: Test requirement passed: err == 0
(gem_mmap_gtt:2672) i915/gem_context-DEBUG: Test requirement passed: has_ban_period || has_bannable
(gem_mmap_gtt:2672) igt_gt-DEBUG: Test requirement passed: has_gpu_reset(fd)
(gem_mmap_gtt:2672) DEBUG: Test requirement passed: igt_sysfs_set_parameter(fd, "reset", "1")
(gem_mmap_gtt:2672) igt_debugfs-DEBUG: Opening debugfs directory '/sys/kernel/debug/dri/0'
(gem_mmap_gtt:2672) INFO: 1099 resets
(gem_mmap_gtt:2672) igt_core-INFO: Timed out waiting for children
****  END  ****
Subtest hang: FAIL (7.417s)
Comment 1 CI Bug Log 2019-02-19 13:28:38 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@gem_mmap_gtt@hang - fail - Timed out waiting for children
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_224/fi-icl-u3/igt@gem_mmap_gtt@hang.html
Comment 2 Chris Wilson 2019-02-19 13:46:08 UTC
https://patchwork.freedesktop.org/patch/286884/
Comment 3 Chris Wilson 2019-02-19 15:28:23 UTC
Oh you reported the icl bogosity and not the genuine bug. Forget about icl, pnv/blb is broken.
Comment 4 Chris Wilson 2019-02-21 14:53:13 UTC
I am ignoring the icl as that is not interesting (just another clock drift)...

commit 43a8f684b6d1e16c6ecf918332f9b35686bf7edd (HEAD -> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Feb 21 10:29:19 2019 +0000

    drm/i915: Reorder struct_mutex-vs-reset_lock in i915_gem_fault()
    
    Annoyingly, struct_mutex was not entirely eliminated from the reset
    pathway; for reasons of its own, intel_display_resume() requires
    struct_mutex to prepare the planes it already captured. To avoid the
    immediate problem of a deadlock between the struct_mutex and the reset
    srcu, we have to acquire the reset_lock before struct_mutex in
    i915_gem_fault(). Now any wait underneath struct_mutex will result us in
    having to forcibly reset all inflight rendering, less than ideal, but
    better than a deadlock (and will do for the short term).
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190221102924.13442-1-chris@chris-wilson.co.uk

Was the bug that should have been reported!
Comment 5 Martin Peres 2019-03-06 15:27:45 UTC
(In reply to Chris Wilson from comment #4)
> I am ignoring the icl as that is not interesting (just another clock
> drift)...
> 
> commit 43a8f684b6d1e16c6ecf918332f9b35686bf7edd (HEAD ->
> drm-intel-next-queued, drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Feb 21 10:29:19 2019 +0000
> 
>     drm/i915: Reorder struct_mutex-vs-reset_lock in i915_gem_fault()
>     
>     Annoyingly, struct_mutex was not entirely eliminated from the reset
>     pathway; for reasons of its own, intel_display_resume() requires
>     struct_mutex to prepare the planes it already captured. To avoid the
>     immediate problem of a deadlock between the struct_mutex and the reset
>     srcu, we have to acquire the reset_lock before struct_mutex in
>     i915_gem_fault(). Now any wait underneath struct_mutex will result us in
>     having to forcibly reset all inflight rendering, less than ideal, but
>     better than a deadlock (and will do for the short term).
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@intel.com>
>     Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190221102924.13442-1-
> chris@chris-wilson.co.uk
> 
> Was the bug that should have been reported!

Thanks for fixing this! However, since this bug was ICL-specific I'm re-opening it, and we know we need to investigate this timer wonkyness...
Comment 6 Martin Peres 2019-04-23 12:53:27 UTC
Also visible in shards! Bumping the priority!
Comment 7 Francesco Balestrieri 2019-04-29 09:37:07 UTC
This commit:

commit 79ffac8599c4d8aa84d313920d3d86d7361c252b
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 24 21:07:17 2019 +0100

    drm/i915: Invert the GEM wakeref hierarchy
    
should make the issue disappear from the test results. We still don't know what causes the sudden slowdown of ICL (also seen elsewhere).

Let's continue monitoring this before resolving, but in any case at least for this particular case it should be sporadic and transient enough to have basically no user impact.
Comment 8 Chris Wilson 2019-05-15 21:10:15 UTC
Long time no see. Let's pretend we did manage to remove a delay with the new and improved reset flush.
Comment 9 Lakshmi 2019-09-24 11:45:35 UTC
This issue is happening very regularly
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_377/fi-icl-u2/igt@gem_mmap_gtt@hang.html
Comment 10 Lakshmi 2019-09-24 11:47:32 UTC
(In reply to Chris Wilson from comment #8)
> Long time no see. Let's pretend we did manage to remove a delay with the new
> and improved reset flush.

(In reply to Lakshmi from comment #9)
> This issue is happening very regularly
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_377/fi-icl-u2/
> igt@gem_mmap_gtt@hang.html

There is a similar failure in PNV as well
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_374/fi-pnv-d510/igt@gem_mmap_gtt@hang.html


Are these failures are (are same?) AND different than the original bug?
Comment 11 Chris Wilson 2019-09-24 14:23:20 UTC
Optimistically,

commit 3499c5eb17054e2abd88023fe962768140d24302 (upstream/master, origin/master, origin/HEAD)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Sep 24 13:15:03 2019 +0100

    i915/gem_map_gtt: Escape from slow forked GTT access
    
    Beware the slithy t'oves.
    
    Forked GTT access on icl is notoriously slow, so rather than spend an
    eternity checking the whole object, check for a completion event after
    handling the pagefault. It's is the race of the pagefault vs reset that
    we care most about, and we expect the bug to result in the pagefault
    being blocked indefinitely, so checking afterwards does not reduce
    coverage.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Comment 12 CI Bug Log 2019-09-25 06:55:29 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_mmap_gtt@hang - fail - Timed out waiting for children -}
{+ ICL: igt@gem_mmap_gtt@hang - fail - Timed out waiting for children +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_374/fi-pnv-d510/igt@gem_mmap_gtt@hang.html
Comment 13 swathi.dhanavanthri 2019-11-22 20:16:09 UTC
Last seen drmtip_377 (1 month, 4 weeks old), not seen in the last 30 runs, so closing and archiving this
Comment 14 CI Bug Log 2019-11-22 20:16:18 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.