Bug 106099 - [CI gdg] igt@gem_exec_reloc@basic-wc-(gtt|cpu)* - fail - Failed assertion: reloc.presumed_offset == offset
Summary: [CI gdg] igt@gem_exec_reloc@basic-wc-(gtt|cpu)* - fail - Failed assertion: re...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: medium normal
Assignee: Tvrtko Ursulin
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 106376 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-04-17 11:40 UTC by Martin Peres
Modified: 2019-10-23 14:31 UTC (History)
1 user (show)

See Also:
i915 platform: I915G
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-04-17 11:40:12 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_20/fi-gdg-551/igt@gem_exec_reloc@basic-wc-read-noreloc.html

(gem_exec_reloc:1318) CRITICAL: Test assertion failure function basic_reloc, file ../tests/gem_exec_reloc.c:422:
(gem_exec_reloc:1318) CRITICAL: Failed assertion: reloc.presumed_offset == offset
(gem_exec_reloc:1318) CRITICAL: error: 0x322000 != 0xffffffff
Subtest basic-wc-read-noreloc failed.
Comment 1 Chris Wilson 2018-04-17 11:46:34 UTC
It hit the slow path (where we have to tell userspace to do relocations on the next pass) where we did not expect it to. Could be an interrupt, could be mempressure, or it could be a bug (in igt or execbuf).
Comment 2 Chris Wilson 2018-05-03 13:20:02 UTC
*** Bug 106376 has been marked as a duplicate of this bug. ***
Comment 3 Martin Peres 2018-05-03 14:20:42 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-gdg-551/igt@gem_exec_big.html

(gem_exec_big:1176) CRITICAL: Test assertion failure function execN, file ../tests/gem_exec_big.c:192:
(gem_exec_big:1176) CRITICAL: Failed assertion: tmp == gem_reloc[n].presumed_offset
(gem_exec_big:1176) CRITICAL: error: -559038845 != 3805184
Test gem_exec_big failed.
Comment 4 Martin Peres 2018-05-22 22:28:41 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_36/fi-gdg-551/igt@gem_exec_reloc@basic-wc-gtt.html

Also seen on a non-noreloc test.
Comment 5 Chris Wilson 2018-05-23 08:52:23 UTC
(In reply to Martin Peres from comment #3)
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-gdg-551/
> igt@gem_exec_big.html
> 
> (gem_exec_big:1176) CRITICAL: Test assertion failure function execN, file
> ../tests/gem_exec_big.c:192:
> (gem_exec_big:1176) CRITICAL: Failed assertion: tmp ==
> gem_reloc[n].presumed_offset
> (gem_exec_big:1176) CRITICAL: error: -559038845 != 3805184
> Test gem_exec_big failed.

Careful, that isn't the same class of failure. That's arguably the same read/write incoherency we see elsewhere in gdg. It's where the offset is 0xffffffff that is a test bug.
Comment 6 Martin Peres 2018-05-23 21:28:15 UTC
(In reply to Chris Wilson from comment #5)
> (In reply to Martin Peres from comment #3)
> > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_29/fi-gdg-551/
> > igt@gem_exec_big.html
> > 
> > (gem_exec_big:1176) CRITICAL: Test assertion failure function execN, file
> > ../tests/gem_exec_big.c:192:
> > (gem_exec_big:1176) CRITICAL: Failed assertion: tmp ==
> > gem_reloc[n].presumed_offset
> > (gem_exec_big:1176) CRITICAL: error: -559038845 != 3805184
> > Test gem_exec_big failed.
> 
> Careful, that isn't the same class of failure. That's arguably the same
> read/write incoherency we see elsewhere in gdg. It's where the offset is
> 0xffffffff that is a test bug.

ok! I will file another bug then!
Comment 7 Chris Wilson 2018-09-06 19:47:23 UTC
Fingers crossed, but

commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Sep 3 09:33:35 2018 +0100

    drm/i915: Force the slow path after a user-write error
    
    If we fail to write the user relocation back when it is changed, force
    ourselves to take the slow relocation path where we can handle faults in
    the write path. There is still an element of dubiousness as having
    patched up the batch to use the correct offset, it no longer matches the
    presumed_offset in the relocation, so a second pass may miss any changes
    in layout.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-chris@chris-wilson.co.uk

seems a more than likely suspect.
Comment 8 Martin Peres 2018-09-20 17:29:31 UTC
(In reply to Chris Wilson from comment #7)
> Fingers crossed, but
> 
> commit fddcd00a49e9122a3579247151e9cb3ce5a1a36e
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Mon Sep 3 09:33:35 2018 +0100
> 
>     drm/i915: Force the slow path after a user-write error
>     
>     If we fail to write the user relocation back when it is changed, force
>     ourselves to take the slow relocation path where we can handle faults in
>     the write path. There is still an element of dubiousness as having
>     patched up the batch to use the correct offset, it no longer matches the
>     presumed_offset in the relocation, so a second pass may miss any changes
>     in layout.
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180903083337.13134-3-
> chris@chris-wilson.co.uk
> 
> seems a more than likely suspect.

This still happened days later: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_108/fi-gdg-551/igt@gem_exec_reloc@basic-wc-cpu-noreloc.html
Comment 9 Francesco Balestrieri 2018-12-04 08:42:37 UTC
Last seen three days ago.
Comment 10 Martin Peres 2018-12-20 15:15:19 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_157/fi-gdg-551/igt@gem_exec_big.html

(gem_exec_big:949) CRITICAL: Test assertion failure function exec1, file ../tests/i915/gem_exec_big.c:112:
(gem_exec_big:949) CRITICAL: Failed assertion: tmp == gem_reloc[0].presumed_offset
(gem_exec_big:949) CRITICAL: error: 0 != 3289088
Test gem_exec_big failed.
Comment 11 CI Bug Log 2019-01-31 13:04:11 UTC
A CI Bug Log filter associated to this bug has been updated:

{- fi-gdg-551: igt@gem_exec_big - fail - Failed assertion: tmp == gem_reloc[(n|0)].presumed_offset -}
{+ GDG HSW: igt@gem_exec_big - fail - Failed assertion: tmp == gem_reloc[(n|0)].presumed_offset +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_197/fi-hsw-peppy/igt@gem_exec_big.html
Comment 12 Francesco Balestrieri 2019-03-01 12:42:13 UTC
Still happening, latest occurrence:

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_231/fi-gdg-551/igt@gem_exec_big.html
Comment 13 Vanshidhar Konda 2019-06-11 17:05:07 UTC
This issue still occurs on the fi-gdg-551 machine every 1-2 weeks. It has not been observed on the HSW machines in past 4 months. In the past 4 months, the bug has only been observed with WC and no-reloc flags on the fi-gdg-551 machine.

In the most recent occurrence of the issue the offset and presumed offset didn't match and presumed_offset != -1.

(gem_exec_reloc:1007) CRITICAL: Failed assertion: reloc.presumed_offset == offset
(gem_exec_reloc:1007) CRITICAL: error: 0x325000 != 0x327000

The older occurrences of the bug do not have logs available. Would it be possible to retain logs for this filter beyond 2 months?
Comment 14 Chris Wilson 2019-06-11 20:54:22 UTC
The only bug that is relevant here is where we report 0xffffffff, i.e. we unexpectedly hit the relocation slow path.

Please do not conflate the wider gdg incoherency with this bug.
Comment 15 CI Bug Log 2019-08-28 12:44:08 UTC
A CI Bug Log filter associated to this bug has been updated:

{- GDG: igt@gem_exec_reloc@basic-wc* - fail - Failed assertion: reloc.presumed_offset == offset -}
{+ GDG: igt@gem_exec_reloc@basic-wc* - fail - Failed assertion: reloc.presumed_offset == offset, error: 0x[\da-f]+ != 0xffffffff +}


  No new failures caught with the new filter
Comment 16 CI Bug Log 2019-08-28 12:44:17 UTC
The CI Bug Log issue associated to this bug has been updated.

### Removed filters

* GDG HSW: igt@gem_exec_big - fail - Failed assertion: tmp == gem_reloc[(n|0)].presumed_offset (added on 6 months, 4 weeks ago)
Comment 17 Martin Peres 2019-08-28 12:45:31 UTC
(In reply to Chris Wilson from comment #14)
> The only bug that is relevant here is where we report 0xffffffff, i.e. we
> unexpectedly hit the relocation slow path.
> 
> Please do not conflate the wider gdg incoherency with this bug.

Filing updated! Thanks!
Comment 18 CI Bug Log 2019-09-02 10:57:13 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* GDG: igt@gem_exec_reloc@basic-wc-noreloc - fail - Failed assertion: reloc.presumed_offset == offset
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_355/fi-gdg-551/igt@gem_exec_reloc@basic-wc-noreloc.html
Comment 19 Lakshmi 2019-10-23 06:51:31 UTC
@Chris, there is a new failure captured under this bug 

https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_389/fi-gdg-551/igt@gem_exec_reloc@basic-wc-read.html
Starting subtest: basic-wc-read
(gem_exec_reloc:964) CRITICAL: Test assertion failure function basic_reloc, file ../tests/i915/gem_exec_reloc.c:424:
(gem_exec_reloc:964) CRITICAL: Failed assertion: reloc.presumed_offset == offset
(gem_exec_reloc:964) CRITICAL: error: 0x30b000 != 0xffffffff
Subtest basic-wc-read failed.
Comment 20 CI Bug Log 2019-10-23 14:31:34 UTC
A CI Bug Log filter associated to this bug has been updated:

{- GDG: igt@gem_exec_reloc@basic-wc-noreloc - fail - Failed assertion: reloc.presumed_offset == offset -}
{+ GDG: igt@gem_exec_reloc@basic-wc-.* - fail - Failed assertion: reloc.presumed_offset == offset +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_389/fi-gdg-551/igt@gem_exec_reloc@basic-wc-gtt-noreloc.html


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.