Bug 108591

Summary: [CI][DRMTIP] igt@gem_tiled_fence_blits@normal - fail - Failed assertion: linear[i] == start_val
Product: DRI Reporter: Martin Peres <martin.peres>
Component: IGTAssignee: Default DRI bug account <dri-devel>
Status: CLOSED FIXED QA Contact:
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: XOrg gitKeywords: regression
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: G45, I945GM, I965GM, ILK i915 features: GEM/Other

Description Martin Peres 2018-10-29 14:38:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-blb-e6850/igt@gem_tiled_fence_blits@normal.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-elk-e7500/igt@gem_tiled_fence_blits@normal.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-gdg-551/igt@gem_tiled_fence_blits@normal.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_135/fi-ilk-650/igt@gem_tiled_fence_blits@normal.html

Starting subtest: normal
(gem_tiled_fence_blits:1151) CRITICAL: Test assertion failure function check_bo, file ../tests/i915/gem_tiled_fence_blits.c:80:
(gem_tiled_fence_blits:1151) CRITICAL: Failed assertion: linear[i] == start_val
(gem_tiled_fence_blits:1151) CRITICAL: Expected 0x1a880000, found 0x2a040000 at offset 0x00000000
Subtest normal failed.


This is likely due to one of these changes:
6fc4e48f9ed4 drm/i915: Compare user's 64b GTT offset even on 32b
9125963a9494 drm/i915: Mark up GTT sizes as u64
Comment 1 Chris Wilson 2018-10-29 14:43:20 UTC
Neither of those are as likely as

commit ff2db94acb53543acd7ba4e2badff59807069365 (upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jul 23 11:39:09 2018 +0100

    igt/gem_tiled_fence_blits: Remove libdrm_intel dependence
    
    Modernise the test to use igt's ioctl library as opposed to the
    antiquated libdrm_intel.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Comment 2 Chris Wilson 2018-10-29 20:51:31 UTC
gdg/blb can be easily explained: https://patchwork.freedesktop.org/patch/259064/

elk/ilk?
Comment 3 Chris Wilson 2018-10-30 17:32:05 UTC
commit 3aedf1b000e27abfa1bf179205a81efe2b76a508 (upstream/master)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Oct 29 20:47:35 2018 +0000

    igt/gem_tiled_fence_blits: Remember to mark up fence blits
    
    Older platforms require fence registers to perform blits, and so
    userspace is expected to mark up the objects to request fences be
    assigned.
    
    Fixes: ff2db94acb53 ("igt/gem_tiled_fence_blits: Remove libdrm_intel dependence")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108591
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>


I haven't yet tested elk/ilk, but on the off chance that this helps...
Comment 4 Martin Peres 2018-11-06 09:11:33 UTC
(In reply to Chris Wilson from comment #3)
> commit 3aedf1b000e27abfa1bf179205a81efe2b76a508 (upstream/master)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Oct 29 20:47:35 2018 +0000
> 
>     igt/gem_tiled_fence_blits: Remember to mark up fence blits
>     
>     Older platforms require fence registers to perform blits, and so
>     userspace is expected to mark up the objects to request fences be
>     assigned.
>     
>     Fixes: ff2db94acb53 ("igt/gem_tiled_fence_blits: Remove libdrm_intel
> dependence")
>     Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=108591
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
> 
> 
> I haven't yet tested elk/ilk, but on the off chance that this helps...

Still happening with ILK and ELK: 

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-ilk-650/igt@gem_tiled_fence_blits@normal.html

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_138/fi-elk-e7500/igt@gem_tiled_fence_blits@normal.html

Starting subtest: normal
(gem_tiled_fence_blits:1194) CRITICAL: Test assertion failure function check_bo, file ../tests/i915/gem_tiled_fence_blits.c:80:
(gem_tiled_fence_blits:1194) CRITICAL: Failed assertion: linear[i] == start_val
(gem_tiled_fence_blits:1194) CRITICAL: Expected 0x1f4c0000, found 0x1f380000 at offset 0x00000000
Comment 5 Chris Wilson 2018-11-06 09:25:23 UTC
As expected, elk/ilk is a completely different bug,https://patchwork.freedesktop.org/series/52013/
and ideally shouldn't be grouped up with the igt bug.
Comment 6 Chris Wilson 2018-11-07 15:36:57 UTC
(In reply to Chris Wilson from comment #5)
> As expected, elk/ilk is a completely different
> bug,https://patchwork.freedesktop.org/series/52013/
> and ideally shouldn't be grouped up with the igt bug.

commit 55f99bf2a9c331838c981694bc872cd1ec4070b2
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 5 09:43:05 2018 +0000

    drm/i915/ringbuffer: Delay after EMIT_INVALIDATE for gen4/gen5
    
    Exercising the gpu reloc path strenuously revealed an issue where the
    updated relocations (from MI_STORE_DWORD_IMM) were not being observed
    upon execution. After some experiments with adding pipecontrols (a lot
    of pipecontrols (32) as gen4/5 do not have a bit to wait on earlier pipe
    controls or even the current on), it was discovered that we merely
    needed to delay the EMIT_INVALIDATE by several flushes. It is important
    to note that it is the EMIT_INVALIDATE as opposed to the EMIT_FLUSH that
    needs the delay as opposed to what one might first expect -- that the
    delay is required for the TLB invalidation to take effect (one presumes
    to purge any CS buffers) as opposed to a delay after flushing to ensure
    the writes have landed before triggering invalidation.
    
    Testcase: igt/gem_tiled_fence_blits
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: stable@vger.kernel.org
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181105094305.5767-1-chris@chris-wilson.co.uk
Comment 7 Martin Peres 2018-11-13 15:52:11 UTC
BLB is still hapenning: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_139/fi-blb-e6850/igt@gem_tiled_fence_blits@normal.html

Starting subtest: normal
(gem_tiled_fence_blits:1040) CRITICAL: Test assertion failure function check_bo, file ../tests/i915/gem_tiled_fence_blits.c:80:
(gem_tiled_fence_blits:1040) CRITICAL: Failed assertion: linear[i] == start_val
(gem_tiled_fence_blits:1040) CRITICAL: Expected 0x06900000, found 0x0aa80000 at offset 0x00000000
Comment 8 Chris Wilson 2018-11-13 16:09:25 UTC
Probably should have mentioned the gpu hang, as that makes it a completely different bug.

<7> [99.123649] hangcheck rcs0
<7> [99.123667] hangcheck \x09current seqno 1ec, last 1fb, hangcheck 1ec [5952 ms]
<7> [99.123671] hangcheck \x09Reset count: 0 (global 0)
<7> [99.123676] hangcheck \x09Requests:
<7> [99.123688] hangcheck \x09\x09first  1ed [4:2ba8] @ 7677ms: rcs0
<7> [99.123693] hangcheck \x09\x09last   1fb [4:2bb6] @ 7672ms: rcs0
<7> [99.123704] hangcheck \x09\x09active 1ed [4:2ba8] @ 7677ms: rcs0
<7> [99.123709] hangcheck \x09\x09ring->start:  0x00002000
<7> [99.123712] hangcheck \x09\x09ring->head:   0x0000ad60
<7> [99.123716] hangcheck \x09\x09ring->tail:   0x0000afc8
<7> [99.123720] hangcheck \x09\x09ring->emit:   0x0000afc8
<7> [99.123724] hangcheck \x09\x09ring->space:  0x00006a78
<7> [99.123729] hangcheck [head ad70, postfix ad88, tail ad98, batch 0x00000000_00326000]:
<7> [99.123745] hangcheck [0000] 02000001 00000000 18800080 00326001 02000000 00000000 10800001 000000c0
<7> [99.123750] hangcheck [0020] 000001ed 01000000
<7> [99.123769] hangcheck \x09RING_START: 0x00002000
<7> [99.123774] hangcheck \x09RING_HEAD:  0x0000ad84
<7> [99.123778] hangcheck \x09RING_TAIL:  0x0000afc8
<7> [99.123782] hangcheck \x09RING_CTL:   0x0001f001
<7> [99.123786] hangcheck \x09RING_MODE:  0x00000000
<7> [99.123791] hangcheck \x09ACTHD:  0x00000000_0060ad84
<7> [99.123795] hangcheck \x09BBADDR: 0x00000000_00000000
<7> [99.123799] hangcheck \x09DMA_FADDR: 0x00000000_0000cfc8
<7> [99.123803] hangcheck \x09IPEIR: 0x00000000
<7> [99.123807] hangcheck \x09IPEHR: 0x02000000
<7> [99.123833] hangcheck \x09\x09E 1ed [4:2ba8] @ 7677ms: rcs0
<7> [99.123870] hangcheck \x09\x09E 1ee [4:2ba9] @ 7676ms: rcs0
<7> [99.123874] hangcheck \x09\x09E 1ef [4:2baa] @ 7676ms: rcs0
<7> [99.123879] hangcheck \x09\x09E 1f0 [4:2bab] @ 7675ms: rcs0
<7> [99.123884] hangcheck \x09\x09E 1f1 [4:2bac] @ 7675ms: rcs0
<7> [99.123888] hangcheck \x09\x09E 1f2 [4:2bad] @ 7675ms: rcs0
<7> [99.123893] hangcheck \x09\x09E 1f3 [4:2bae] @ 7675ms: rcs0
<7> [99.123897] hangcheck \x09\x09...skipping 7 executing requests...
<7> [99.123902] hangcheck \x09\x09E 1fb [4:2bb6] @ 7672ms: rcs0
<7> [99.123905] hangcheck \x09\x09Queue priority: -2147483648
<7> [99.123926] hangcheck \x09gem_tiled_fence [1040] waiting for 1ed
<7> [99.123955] hangcheck IRQ? 0x1 (breadcrumbs? yes)
<7> [99.123959] hangcheck HWSP:
<7> [99.123965] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [99.123968] hangcheck *
<7> [99.123974] hangcheck [00c0] 000001ec 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [99.123979] hangcheck [00e0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [99.123983] hangcheck *
<7> [99.123987] hangcheck Idle? no

Looks intact. So could be a reloc coherency issue and it tried to read/write into garbage, but then it just uses the stale locations of old buffers. Still that's my leading candidate.

Fwiw, my slow pIIIm i915gm doesn't seem to suffer the same fate.
Comment 9 Chris Wilson 2018-11-13 16:10:02 UTC
(In reply to Chris Wilson from comment #8)
> Fwiw, my slow pIIIm i915gm doesn't seem to suffer the same fate.

Except I should check pnv for my closest equiv to blb.
Comment 10 Chris Wilson 2018-11-20 09:53:34 UTC
commit 7fa28e146994da1e8a4124623d7da97b798ea520 (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Nov 19 15:41:53 2018 +0000

    drm/i915: Write GPU relocs harder with gen3
    
    Under moderate amounts of GPU stress, we can observe on Bearlake and
    Pineview (later gen3 models) that we execute the following batch buffer
    before the write into the batch is coherent. Adding extra (tested with
    upto 32x) MI_FLUSH to either the invalidation, flush or both phases does
    not solve the incoherency issue with the relocations, but emitting the
    MI_STORE_DWORD_IMM twice does. So be it.
    
    Fixes: 7dd4f6729f92 ("drm/i915: Async GPU relocation processing")
    Testcase: igt/gem_tiled_fence_blits # blb/pnv
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20181119154153.15327-1-chris@chris-wilson.co.uk
Comment 11 Martin Peres 2019-01-04 15:42:07 UTC
(In reply to Chris Wilson from comment #10)
> commit 7fa28e146994da1e8a4124623d7da97b798ea520 (HEAD ->
> drm-intel-next-queued, drm-intel/for-linux-next,
> drm-intel/drm-intel-next-queued)
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Nov 19 15:41:53 2018 +0000
> 
>     drm/i915: Write GPU relocs harder with gen3
>     
>     Under moderate amounts of GPU stress, we can observe on Bearlake and
>     Pineview (later gen3 models) that we execute the following batch buffer
>     before the write into the batch is coherent. Adding extra (tested with
>     upto 32x) MI_FLUSH to either the invalidation, flush or both phases does
>     not solve the incoherency issue with the relocations, but emitting the
>     MI_STORE_DWORD_IMM twice does. So be it.
>     
>     Fixes: 7dd4f6729f92 ("drm/i915: Async GPU relocation processing")
>     Testcase: igt/gem_tiled_fence_blits # blb/pnv
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20181119154153.15327-1-
> chris@chris-wilson.co.uk

Oddly-enough, this was not sufficient to fix the issue, but it stopped failing after drmtip_176 (https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_176/fi-gdg-551/igt@gem_tiled_fence_blits@normal.html), so closing!

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.