Bug 110380

Summary:	[CI][SHARDS] igt@gem_exec_schedule@semaphore-codependency - fail - Failed assertion: !"GPU hung"
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	RESOLVED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	high	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	BXT, GLK, ICL, KBL	i915 features:	GEM/Other

Description Martin Peres 2019-04-10 13:07:50 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4937/shard-glk6/igt@gem_exec_schedule@semaphore-codependency.html

Starting subtest: semaphore-codependency
(gem_exec_schedule:12832) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:501:
(gem_exec_schedule:12832) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest semaphore-codependency failed.

Comment 1 CI Bug Log 2019-04-10 13:08:57 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* APL KBL GLK ICL: igt@gem_exec_schedule@semaphore-codependency - fail - Failed assertion: !&quot;GPU hung&quot;
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2832/shard-apl6/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2832/shard-iclb6/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2832/shard-kbl1/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2833/shard-apl4/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2833/shard-glk8/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2833/shard-kbl2/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4937/shard-apl2/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4937/shard-glk6/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4937/shard-iclb6/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4937/shard-kbl4/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2834/shard-apl8/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2834/shard-glk5/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_2834/shard-kbl2/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4936/shard-glk8/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4936/shard-iclb5/igt@gem_exec_schedule@semaphore-codependency.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4936/shard-kbl6/igt@gem_exec_schedule@semaphore-codependency.html

Comment 2 Chris Wilson 2019-04-10 13:11:34 UTC

Hindsight is perfect. If only I had thought of this test before writing the code.

Fix all ready to go, https://patchwork.freedesktop.org/series/59232/ just waiting for a report from the media guys if it fixes their perf regression. Maybe they'll even get around to reporting a bug.

Comment 3 CI Bug Log 2019-04-10 14:22:57 UTC

A CI Bug Log filter associated to this bug has been updated:

{- APL KBL GLK ICL: igt@gem_exec_schedule@semaphore-codependency - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ SKL APL KBL GLK ICL: igt@gem_exec_schedule@semaphore-codependency - fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5898/shard-skl8/igt@gem_exec_schedule@semaphore-codependency.html

Comment 4 Chris Wilson 2019-04-11 06:43:59 UTC

commit b7404c7ecb38b66f103cec694e23a8e99252829e (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Apr 9 16:29:22 2019 +0100

    drm/i915: Bump ready tasks ahead of busywaits
    
    Consider two tasks that are running in parallel on a pair of engines
    (vcs0, vcs1), but then must complete on a shared engine (rcs0). To
    maximise throughput, we want to run the first ready task on rcs0 (i.e.
    the first task that completes on either of vcs0 or vcs1). When using
    semaphores, however, we will instead queue onto rcs in submission order.
    
    To resolve this incorrect ordering, we want to re-evaluate the priority
    queue when each of the request is ready. Normally this happens because
    we only insert into the priority queue requests that are ready, but with
    semaphores we are inserting ahead of their readiness and to compensate
    we penalize those tasks with reduced priority (so that tasks that do not
    need to busywait should naturally be run first). However, given a series
    of tasks that each use semaphores, the queue degrades into submission
    fifo rather than readiness fifo, and so to counter this we give a small
    boost to semaphore users as their dependent tasks are completed (and so
    we no longer require any busywait prior to running the user task as they
    are then ready themselves).
    
    v2: Fixup irqsave for schedule_lock (Tvrtko)
    
    Testcase: igt/gem_exec_schedule/semaphore-codependency
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
    Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190409152922.23894-1-chris@chris-wilson.co.uk

Comment 5 Martin Peres 2019-04-17 07:57:26 UTC

(In reply to Chris Wilson from comment #4)
> commit b7404c7ecb38b66f103cec694e23a8e99252829e (HEAD ->
> drm-intel-next-queued, drm-intel/for-linux-next,
> drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Tue Apr 9 16:29:22 2019 +0100
> 
>     drm/i915: Bump ready tasks ahead of busywaits
>     
>     Consider two tasks that are running in parallel on a pair of engines
>     (vcs0, vcs1), but then must complete on a shared engine (rcs0). To
>     maximise throughput, we want to run the first ready task on rcs0 (i.e.
>     the first task that completes on either of vcs0 or vcs1). When using
>     semaphores, however, we will instead queue onto rcs in submission order.
>     
>     To resolve this incorrect ordering, we want to re-evaluate the priority
>     queue when each of the request is ready. Normally this happens because
>     we only insert into the priority queue requests that are ready, but with
>     semaphores we are inserting ahead of their readiness and to compensate
>     we penalize those tasks with reduced priority (so that tasks that do not
>     need to busywait should naturally be run first). However, given a series
>     of tasks that each use semaphores, the queue degrades into submission
>     fifo rather than readiness fifo, and so to counter this we give a small
>     boost to semaphore users as their dependent tasks are completed (and so
>     we no longer require any busywait prior to running the user task as they
>     are then ready themselves).
>     
>     v2: Fixup irqsave for schedule_lock (Tvrtko)
>     
>     Testcase: igt/gem_exec_schedule/semaphore-codependency
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Cc: Dmitry Rogozhkin <dmitry.v.rogozhkin@intel.com>
>     Cc: Dmitry Ermilov <dmitry.ermilov@intel.com>
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20190409152922.23894-1-
> chris@chris-wilson.co.uk

Thanks, this definitely fixed the issue! It used to fail multiple times per run (~3) and now not seen in 36 runs.

Comment 6 CI Bug Log 2019-04-17 07:57:34 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.