Bug 106887 - [CI] igt@gem_ppgtt@blt-vs-render-ctxn - incomplete
Summary: [CI] igt@gem_ppgtt@blt-vs-render-ctxn - incomplete
Status: RESOLVED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Andi
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 106023 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-06-11 13:42 UTC by Martin Peres
Modified: 2019-03-08 15:24 UTC (History)
2 users (show)

See Also:
i915 platform: BSW/CHT, CFL, ICL, KBL, SKL, SNB
i915 features: GEM/PPGTT


Attachments

Comment 6 Lakshmi 2018-10-26 15:36:23 UTC
*** Bug 106023 has been marked as a duplicate of this bug. ***
Comment 7 Andi 2018-10-29 15:56:35 UTC
These are the steps to manually reproduce the bug (from the igt directory):


./build/tests/gem_exec_capture --run-subtest userptr
./build/tests/gem_ppgtt --run-subtest blt-vs-render-ctxn

It fails always. The failure happens in the gen9_render_copyfunc() function (because I'm using Kaby Lake).

The test *might* fail randomly out of 8 times 32768 (0x8000).
Comment 8 Andi 2018-10-29 16:09:35 UTC
Just to clarify what I wrote above:

The function gen9_render_copyfunc() is called under a for loop that runs (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In my tests I always got a failure.

To simplify the debugging I run it in a single thread by applying the following:

diff --git a/tests/i915/gem_ppgtt.c b/tests/i915/gem_ppgtt.c
index af5e3e07..66b71a68 100644
--- a/tests/i915/gem_ppgtt.c
+++ b/tests/i915/gem_ppgtt.c
@@ -294,7 +294,7 @@ static void flink_and_exit(void)
        close(fd);
 }
 
-#define N_CHILD 8
+#define N_CHILD 1
 int main(int argc, char **argv)
 {
        igt_subtest_init(argc, argv);

that makes the loop iterate exactly 32768 times in a single process.

NOTE: in a single process the gen9_render_copyfunc() doesn't always fail, but it does with a high rate.
Comment 9 Chris Wilson 2018-10-29 16:15:37 UTC
(In reply to Andi from comment #8)
> Just to clarify what I wrote above:
> 
> The function gen9_render_copyfunc() is called under a for loop that runs
> (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In
> my tests I always got a failure.

You still haven't said what failure.
Comment 10 Andi 2018-10-31 08:50:47 UTC
(In reply to Chris Wilson from comment #9)
> (In reply to Andi from comment #8)
> > Just to clarify what I wrote above:
> > 
> > The function gen9_render_copyfunc() is called under a for loop that runs
> > (32768/8 = ) 4096. This loop is executed by 8 processes simultaneously. In
> > my tests I always got a failure.
> 
> You still haven't said what failure.

the system crashes and I need to reboot by removing the power.
Comment 11 Chris Wilson 2018-11-08 12:25:05 UTC
Could it be

commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Thu Nov 8 08:17:38 2018 +0000

    drm/i915/execlists: Force write serialisation into context image vs execution
    
    Ensure that the writes into the context image are completed prior to the
    register mmio to trigger execution. Although previously we were assured
    by the SDM that all writes are flushed before an uncached memory
    transaction (our mmio write to submit the context to HW for execution),
    we have empirical evidence to believe that this is not actually the
    case.
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108656
    References: https://bugs.freedesktop.org/show_bug.cgi?id=108315
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106887
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181108081740.25615-1-chris@chris-wilson.co.uk
    Cc: stable@vger.kernel.org

?
Comment 12 Chris Wilson 2018-11-09 12:08:21 UTC
(In reply to Chris Wilson from comment #11)
> Could it be
> 
> commit 987abd5c62f92ee4970b45aa077f47949974e615 (HEAD ->
> drm-intel-next-queued, drm-intel/for-linux-next,
> drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Thu Nov 8 08:17:38 2018 +0000
> 
>     drm/i915/execlists: Force write serialisation into context image vs
> execution

Nope, struck again after.

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5108/shard-kbl6/igt@gem_ppgtt@blt-vs-render-ctx0.html
Comment 13 Chris Wilson 2018-12-05 09:14:31 UTC
Optimistically,


commit 4a15c75c42460252a63d30f03b4766a52945fb47
Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Date:   Mon Dec 3 13:33:41 2018 +0000

    drm/i915: Introduce per-engine workarounds
    
    We stopped re-applying the GT workarounds after engine reset since commit
    59b449d5c82a ("drm/i915: Split out functions for different kinds of
    workarounds").
    
    Issue with this is that some of the GT workarounds live in the MMIO space
    which gets lost during engine resets. So far the registers in 0x2xxx and
    0xbxxx address range have been identified to be affected.
    
    This losing of applied workarounds has obvious negative effects and can
    even lead to hard system hangs (see the linked Bugzilla).
    
    Rather than just restoring this re-application, because we have also
    observed that it is not safe to just re-write all GT workarounds after
    engine resets (GPU might be live and weird hardware states can happen),
    we introduce a new class of per-engine workarounds and move only the
    affected GT workarounds over.
    
    Using the framework introduced in the previous patch, we therefore after
    engine reset, re-apply only the workarounds living in the affected MMIO
    address ranges.
    
    v2:
     * Move Wa_1406609255:icl to engine workarounds as well.
     * Rename API. (Chris Wilson)
     * Drop redundant IS_KABYLAKE. (Chris Wilson)
     * Re-order engine wa/ init so latest platforms are first. (Rodrigo Vivi)
    
    Signed-off-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Bugzilla: https://bugzilla.freedesktop.org/show_bug.cgi?id=107945
    Fixes: 59b449d5c82a ("drm/i915: Split out functions for different kinds of workarounds")
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Jani Nikula <jani.nikula@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Cc: intel-gfx@lists.freedesktop.org
    Acked-by: Rodrigo Vivi <rodrigo.vivi@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181203133341.10258-1-tvrtko.ursulin@linux.intel.com
Comment 14 Lakshmi 2018-12-18 09:37:59 UTC
Also seen on ICL
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/igt@gem_ppgtt@blt-vs-render-ctx0.html

Is this a different issue?
Comment 15 Francesco Balestrieri 2019-01-09 09:02:20 UTC
Latest occurrence ~4 weeks ago:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/igt@gem_ppgtt@blt-vs-render-ctx0.html

Let's keep monitoring...
Comment 16 Chris Wilson 2019-01-15 17:42:16 UTC
(In reply to Lakshmi from comment #14)
> Also seen on ICL
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_168/fi-icl-u3/
> igt@gem_ppgtt@blt-vs-render-ctx0.html
> 
> Is this a different issue?

Yes. icl is beset by a number of problems, but there is a definite system hang (hard machine lockup) on kbl that is very likely kbl specific. Any complete system lockup is likely machine specific and is best kept separate until root caused.
Comment 17 CI Bug Log 2019-03-08 15:24:15 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.