Bug 108569 - [CI][BAT icl] igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 HSDES#:1807136187
Summary: [CI][BAT icl] igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly ...
Status: RESOLVED NOTOURBUG
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest normal
Assignee: Mika Kuoppala
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
: 108342 109210 109226 (view as bug list)
Depends on:
Blocks:
 
Reported: 2018-10-26 14:47 UTC by Martin Peres
Modified: 2019-04-11 21:23 UTC (History)
3 users (show)

See Also:
i915 platform: ICL
i915 features: GEM/Other


Attachments

Description Martin Peres 2018-10-26 14:47:54 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4697/fi-icl-u/igt@drv_selftest@live_contexts.html

<3> [549.258788] i915/i915_gem_context_live_selftests: igt_ctx_readonly failed with error -5
Comment 1 Francesco Balestrieri 2018-10-30 09:29:32 UTC
Investigating a possible reason with https://patchwork.freedesktop.org/series/51703/
Comment 2 Chris Wilson 2018-10-30 09:33:34 UTC
(In reply to Francesco Balestrieri from comment #1)
> Investigating a possible reason with
> https://patchwork.freedesktop.org/series/51703/

That was for bug 108315. This one there isn't much we can do but declare that read-only support on icl is bust and not use it (at the cost of reduced uAPI and having to allocate separate 64k large pages for scratch and PD in every context).
Comment 3 Francesco Balestrieri 2018-10-30 09:50:39 UTC
Sorry for the confusion, too many copy-pastes. 

> This one there isn't much we can do but declare that read-only support on icl
> is bust and not use it (at the cost of reduced uAPI and having to allocate separate
> 64k large pages for scratch and PD in every context)

Are there patches that work around this issue? I saw scratch and 64K mentioned in some but I don't know if it's about this or something else.
Comment 4 Lakshmi 2018-11-05 16:27:07 UTC
Dropping the priority to High as this bug fix doesn't depend on the drm/i915 Intel GFX Driver.
Comment 5 Chris Wilson 2019-01-02 13:33:37 UTC
*** Bug 109210 has been marked as a duplicate of this bug. ***
Comment 6 CI Bug Log 2019-01-02 13:45:22 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_selftest@live_contexts - dmesg-fail - igt_vm_isolation failed with error -5 (No new failures associated)
Comment 7 Chris Wilson 2019-01-04 16:35:56 UTC
*** Bug 109226 has been marked as a duplicate of this bug. ***
Comment 8 CI Bug Log 2019-01-14 17:12:12 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_selftest@live_contexts - incomplete
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5416/fi-icl-u2/igt@i915_selftest@live_contexts.html
Comment 9 CI Bug Log 2019-01-16 09:13:14 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_selftest@live_contexts - incomplete -}
{+ ICL: igt@i915_selftest@live_(hangcheck|contexts) - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5428/fi-icl-u3/igt@i915_selftest@live_hangcheck.html
Comment 10 CI Bug Log 2019-01-21 16:19:16 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_selftest@live_(hangcheck|contexts) - incomplete -}
{+ ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5452/shard-iclb7/igt@gem_ctx_switch@basic-default.html
Comment 11 CI Bug Log 2019-02-02 14:25:52 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 -}
{+ ICL: igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5519/fi-icl-y/igt@i915_selftest@live_contexts.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5520/fi-icl-y/igt@i915_selftest@live_contexts.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5521/fi-icl-y/igt@i915_selftest@live_contexts.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5522/fi-icl-y/igt@i915_selftest@live_contexts.html
Comment 12 CI Bug Log 2019-02-02 14:26:24 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete -}
{+ ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5521/fi-icl-y/igt@i915_selftest@live_hangcheck.html
Comment 13 Lakshmi 2019-02-14 16:53:37 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_218/fi-icl-u2/igt@gem_eio@reset-stress.html

<3> [132.315742] process_csb:1103 GEM_BUG_ON(!i915_request_completed(rq))
<2> [132.315843] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1103!  


Mika/Chris, Can you confirm if this failure is related to this bug?
Comment 14 Chris Wilson 2019-02-14 17:02:58 UTC
(In reply to Lakshmi from comment #13)
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_218/fi-icl-u2/
> igt@gem_eio@reset-stress.html
> 
> <3> [132.315742] process_csb:1103 GEM_BUG_ON(!i915_request_completed(rq))
> <2> [132.315843] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1103!  
> 
> 
> Mika/Chris, Can you confirm if this failure is related to this bug?

It doesn't follow the same sequence of events, so we can definitely rule out the earlier class of read-only hangs being involved here. That just looks like a more regular incomplete request following a reset.
Comment 15 Chris Wilson 2019-02-15 16:34:19 UTC
*** Bug 108342 has been marked as a duplicate of this bug. ***
Comment 17 Martin Peres 2019-04-04 07:09:59 UTC
Moving to highest as this is a simple way to make denial of services, which is not acceptable.
Comment 18 Martin Peres 2019-04-05 06:54:14 UTC
(In reply to Martin Peres from comment #17)
> Moving to highest as this is a simple way to make denial of services, which
> is not acceptable.

Let me rephrase that:
 - We don't know if any userspace driver uses RO pages
 - No patches to disable RO support on ICL have been sent
 - Leaving the feature available leads to a denial of service when the userspace is not careful-enough not to write to a RO page

This is why the priority is set to highest.
Comment 19 Lakshmi 2019-04-09 12:09:17 UTC
Impact to users
---------------

Since BDW the HW has had support for read-only pages in the PPGGT. This is used internally by the driver to share scratch pages between objects, saving memory, and is exposed to UMDs via the UserPtr API.

Before ICL, writing to a read-only page was silently dropped. In ICL, it hangs the GPU. There are a few ways this can affect users:

1) there is a bug in either driver or userspace that mistakenly tries to write to a read-only page, we'll have a hang.

2) userspace relies on the pre-ICL behavior and decides to write to read-only page assuming nothing will happen (like the test in question does), getting a hang instead.

Considering that this bug has been present for months, and hasn't been reported by UMDs, it is reasonable to assume that the impact of 1 and 2 is low. OCL team confirmed they don't use this feature, and we are not aware of media and Mesa doing it, although it needs to be confirmed. This is however something we should fix or prevent to avoid surprises.


Way forward
-----------

As a workaround, it is possible to disable read-only page support for ICL in the driver. We will lose the ability to share scratch pages between objects, requiring a 64k page allocation every time causing memory waste and fragmentation. We will also be sporadically unable to use hugepages in the GPU and will need to handle the userspace, which is possible but likely to introduce new bugs (more details should be asked from Wilson, Chris P whose explanation I'm paraphrasing). 

Implementing the above workaround is a matter of a few days plus leaving enough time to get some extensive CI runs. However, we are investigating other options first.
Comment 20 Martin Peres 2019-04-11 11:44:36 UTC
The workaround has been sent to the mailing list: https://patchwork.freedesktop.org/series/59323/
Comment 21 Chris Wilson 2019-04-11 21:23:19 UTC
commit 3936867dbc1eb8790aa5985a68d53e4303b3616f
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Apr 11 11:30:34 2019 +0300

    drm/i915: Disable read only ppgtt support for gen11
    
    On gen11 writing to read only ppgtt page causes a gpu hang.
    This behaviour is different than with previous gen where
    read only ppgtt access is supported. On those, the write
    is just dropped without visible side effects.
    
    Disable ro ppgtt support on gen11 until a solution can
    be found to bring it into line with its predecessors.
    
    References: HSDES#1807136187
    References: https://bugzilla.freedesktop.org/show_bug.cgi?id=108569
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Acked-by: Chris Wilson <chris@chris-wilson.co.uk>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190411083034.28311-1-mika.kuoppala@linux.intel.com


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.