Summary: | [CI][BAT icl] igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 HSDES#:1807136187 | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Mika Kuoppala <mika.kuoppala> |
Status: | RESOLVED NOTOURBUG | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | highest | CC: | chris, intel-gfx-bugs, rakesh.riddickt789 |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | ICL | i915 features: | GEM/Other |
Description
Martin Peres
2018-10-26 14:47:54 UTC
Investigating a possible reason with https://patchwork.freedesktop.org/series/51703/ (In reply to Francesco Balestrieri from comment #1) > Investigating a possible reason with > https://patchwork.freedesktop.org/series/51703/ That was for bug 108315. This one there isn't much we can do but declare that read-only support on icl is bust and not use it (at the cost of reduced uAPI and having to allocate separate 64k large pages for scratch and PD in every context). Sorry for the confusion, too many copy-pastes.
> This one there isn't much we can do but declare that read-only support on icl
> is bust and not use it (at the cost of reduced uAPI and having to allocate separate
> 64k large pages for scratch and PD in every context)
Are there patches that work around this issue? I saw scratch and 64K mentioned in some but I don't know if it's about this or something else.
Dropping the priority to High as this bug fix doesn't depend on the drm/i915 Intel GFX Driver. *** Bug 109210 has been marked as a duplicate of this bug. *** The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@i915_selftest@live_contexts - dmesg-fail - igt_vm_isolation failed with error -5 (No new failures associated) *** Bug 109226 has been marked as a duplicate of this bug. *** The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@i915_selftest@live_contexts - incomplete - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5416/fi-icl-u2/igt@i915_selftest@live_contexts.html A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@i915_selftest@live_contexts - incomplete -} {+ ICL: igt@i915_selftest@live_(hangcheck|contexts) - incomplete +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5428/fi-icl-u3/igt@i915_selftest@live_hangcheck.html A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@i915_selftest@live_(hangcheck|contexts) - incomplete -} {+ ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5452/shard-iclb7/igt@gem_ctx_switch@basic-default.html A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 -} {+ ICL: igt@drv_selftest@live_contexts - dmesg-fail - igt_ctx_readonly failed with error -5 +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5519/fi-icl-y/igt@i915_selftest@live_contexts.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5520/fi-icl-y/igt@i915_selftest@live_contexts.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5521/fi-icl-y/igt@i915_selftest@live_contexts.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5522/fi-icl-y/igt@i915_selftest@live_contexts.html A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete -} {+ ICL: igt@i915_selftest@live_(hangcheck|contexts) / igt@gem_ctx_switch@basic-default - incomplete +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5521/fi-icl-y/igt@i915_selftest@live_hangcheck.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_218/fi-icl-u2/igt@gem_eio@reset-stress.html <3> [132.315742] process_csb:1103 GEM_BUG_ON(!i915_request_completed(rq)) <2> [132.315843] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1103! Mika/Chris, Can you confirm if this failure is related to this bug? (In reply to Lakshmi from comment #13) > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_218/fi-icl-u2/ > igt@gem_eio@reset-stress.html > > <3> [132.315742] process_csb:1103 GEM_BUG_ON(!i915_request_completed(rq)) > <2> [132.315843] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:1103! > > > Mika/Chris, Can you confirm if this failure is related to this bug? It doesn't follow the same sequence of events, so we can definitely rule out the earlier class of read-only hangs being involved here. That just looks like a more regular incomplete request following a reset. *** Bug 108342 has been marked as a duplicate of this bug. *** The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL:igt@runner@aborted - fail - Previous test: i915_selftest (live_hangcheck) - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12138/fi-icl-u3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3647/fi-kbl-r/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12151/fi-icl-u3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12215/shard-iclb1/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12226/fi-icl-y/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3755/fi-icl-u2/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3755/fi-icl-u3/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3755/fi-kbl-7560u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3755/fi-kbl-r/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_3755/fi-whl-u/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4840/shard-iclb5/igt@runner@aborted.html Moving to highest as this is a simple way to make denial of services, which is not acceptable. (In reply to Martin Peres from comment #17) > Moving to highest as this is a simple way to make denial of services, which > is not acceptable. Let me rephrase that: - We don't know if any userspace driver uses RO pages - No patches to disable RO support on ICL have been sent - Leaving the feature available leads to a denial of service when the userspace is not careful-enough not to write to a RO page This is why the priority is set to highest. Impact to users --------------- Since BDW the HW has had support for read-only pages in the PPGGT. This is used internally by the driver to share scratch pages between objects, saving memory, and is exposed to UMDs via the UserPtr API. Before ICL, writing to a read-only page was silently dropped. In ICL, it hangs the GPU. There are a few ways this can affect users: 1) there is a bug in either driver or userspace that mistakenly tries to write to a read-only page, we'll have a hang. 2) userspace relies on the pre-ICL behavior and decides to write to read-only page assuming nothing will happen (like the test in question does), getting a hang instead. Considering that this bug has been present for months, and hasn't been reported by UMDs, it is reasonable to assume that the impact of 1 and 2 is low. OCL team confirmed they don't use this feature, and we are not aware of media and Mesa doing it, although it needs to be confirmed. This is however something we should fix or prevent to avoid surprises. Way forward ----------- As a workaround, it is possible to disable read-only page support for ICL in the driver. We will lose the ability to share scratch pages between objects, requiring a 64k page allocation every time causing memory waste and fragmentation. We will also be sporadically unable to use hugepages in the GPU and will need to handle the userspace, which is possible but likely to introduce new bugs (more details should be asked from Wilson, Chris P whose explanation I'm paraphrasing). Implementing the above workaround is a matter of a few days plus leaving enough time to get some extensive CI runs. However, we are investigating other options first. The workaround has been sent to the mailing list: https://patchwork.freedesktop.org/series/59323/ commit 3936867dbc1eb8790aa5985a68d53e4303b3616f Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Thu Apr 11 11:30:34 2019 +0300 drm/i915: Disable read only ppgtt support for gen11 On gen11 writing to read only ppgtt page causes a gpu hang. This behaviour is different than with previous gen where read only ppgtt access is supported. On those, the write is just dropped without visible side effects. Disable ro ppgtt support on gen11 until a solution can be found to bring it into line with its predecessors. References: HSDES#1807136187 References: https://bugzilla.freedesktop.org/show_bug.cgi?id=108569 Cc: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Acked-by: Chris Wilson <chris@chris-wilson.co.uk> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: https://patchwork.freedesktop.org/patch/msgid/20190411083034.28311-1-mika.kuoppala@linux.intel.com |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.