Summary: | [CI][BAT][iommu]igt@gem_exec_suspend@basic-s4-devices - fail - DMAR write fault 7 + Failed assertion: !"GPU hung" | ||
---|---|---|---|
Product: | DRI | Reporter: | Lakshmi <lakshminarayana.vudum> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | major | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | ICL | i915 features: | GPU hang, power/runtime PM |
Description
Lakshmi
2019-09-16 12:03:07 UTC
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !"GPU hung" - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6893/fi-icl-u3/igt@gem_exec_suspend@basic-s4-devices.html - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_5020/fi-icl-u4/igt@gem_exec_suspend@basic-s4-devices.html Correlates with <3> [90.027305] DMAR: DRHD: handling fault status reg 2 <3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000 [fault reason 07] Next page table ptr is invalid In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution to the ring. I presume that's when the lookup failed. DMA_FADDR: 0x00000000_00013870 didn't cross a page so what it was looking up that failed is a mystery. "fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4 But write? gem_exec_suspend does use a scratch page to verify HW works across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that is absent. It's worth pointing out the failure was before the suspend; so do we have a coherency issue with the dma-mapping? A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !"GPU hung" -} {+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !"GPU hung" +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/igt@gem_exec_suspend@basic.html (In reply to Chris Wilson from comment #2) > Correlates with > > <3> [90.027305] DMAR: DRHD: handling fault status reg 2 > <3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000 > [fault reason 07] Next page table ptr is invalid > > In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution > to the ring. I presume that's when the lookup failed. > > DMA_FADDR: 0x00000000_00013870 > > didn't cross a page so what it was looking up that failed is a mystery. > > "fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4 > But write? gem_exec_suspend does use a scratch page to verify HW works > across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that > is absent. (In reply to CI Bug Log from comment #4) > A CI Bug Log filter associated to this bug has been updated: > > {- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: > !"GPU hung" -} > {+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: > !"GPU hung" +} > > New failures caught by the filter: > > * > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/ > igt@gem_exec_suspend@basic.html One more instance, but on shards. <3> [209.832569] DMAR: DRHD: handling fault status reg 2 <3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid DMA_FADDR: 0x00000000_007f8b78 A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !"GPU hung" -} {+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !"GPU hung" +} No new failures caught with the new filter <3> [209.832569] DMAR: DRHD: handling fault status reg 2 <3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid <7> [212.694638] [drm:edp_panel_vdd_off_sync [i915]] Turning [ENCODER:214:DDI A] VDD off <7> [212.694912] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067 <7> [212.694994] [drm:intel_power_well_disable [i915]] disabling DC off <7> [212.695078] [drm:skl_enable_dc6 [i915]] Enabling DC6 <7> [212.695165] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 02 <7> [217.942713] hangcheck bcs0 <7> [217.942719] hangcheck Awake? 2 <7> [217.942723] hangcheck Hangcheck: 6016 ms ago <7> [217.942727] hangcheck Reset count: 0 (global 740) <7> [217.942730] hangcheck Requests: <7> [217.942743] hangcheck active 617:1e0*- prio=3 @ 8110ms: gem_exec_suspen[2283] <7> [217.942747] hangcheck ring->start: 0x007f6000 <7> [217.942750] hangcheck ring->head: 0x00002ae0 <7> [217.942754] hangcheck ring->tail: 0x00002b78 <7> [217.942757] hangcheck ring->emit: 0x000037c0 <7> [217.942760] hangcheck ring->space: 0x000032e0 <7> [217.942763] hangcheck ring->hwsp: 0xffffa180 <7> [217.942767] hangcheck [head 2b10, postfix 2b50, tail 2b80, batch 0x00000000_00040000]: <7> [217.942789] hangcheck [0000] 13244002 00000204 00000000 00000000 02800000 00000000 10400002 ffffa180 <7> [217.942794] hangcheck [0020] 00000000 000001df 04000001 18800101 00040000 00000000 04000000 00000000 <7> [217.942798] hangcheck [0040] 13004002 ffffa184 00000000 000001e0 01000000 04000001 0e40c002 00000000 <7> [217.942802] hangcheck [0060] ffffd0c8 00000000 02800000 00000000 <7> [217.942812] hangcheck MMIO base: 0x00022000 <7> [217.942824] hangcheck RING_START: 0x007f6000 <7> [217.942830] hangcheck RING_HEAD: 0x00002b48 <7> [217.942837] hangcheck RING_TAIL: 0x00002b78 <7> [217.942846] hangcheck RING_CTL: 0x00003001 <7> [217.942856] hangcheck RING_MODE: 0x00000000 <7> [217.942863] hangcheck RING_IMR: 00000000 <7> [217.942882] hangcheck ACTHD: 0x00000000_00202b48 <7> [217.942895] hangcheck BBADDR: 0x00000000_00040fd4 <7> [217.942908] hangcheck DMA_FADDR: 0x00000000_007f8b78 <7> [217.942915] hangcheck IPEIR: 0x00000000 <7> [217.942921] hangcheck IPEHR: 0x05000000 <7> [217.942932] hangcheck Execlist status: 0x00001098 60000020, entries 12 <7> [217.942936] hangcheck Execlist CSB read 3, write 3, tasklet queued? no (enabled) <7> [217.942943] hangcheck Active[0: ring:{start:007f6000, hwsp:ffffa180, seqno:000001df}, rq: 617:1e0*- prio=3 @ 8110ms: gem_exec_suspen[2283] <7> [217.942950] hangcheck E 617:1e0*- prio=3 @ 8110ms: gem_exec_suspen[2283] <7> [217.942953] hangcheck HWSP: <7> [217.942958] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7> [217.942962] hangcheck * <7> [217.942967] hangcheck [0040] 00000018 60000020 00000001 60000000 00000018 60000020 00000001 60000000 <7> [217.942970] hangcheck * <7> [217.942974] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003 <7> [217.942979] hangcheck [00c0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 <7> [217.942982] hangcheck * <7> [217.942989] hangcheck Idle? no So it the fault addr of 0x41000 is matching the page after the batch (BBADDR: 0x40fd4). The write is puzzling. The BBADDR is close enough to the page boundary for the 128-byte prefetch to cross into the next page, but it should not be a write for the CS parser. And it should happily be a scratch page, or the store buffer. A CI Bug Log filter associated to this bug has been updated: {- ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !"GPU hung" -} {+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !"GPU hung" +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6996/shard-iclb2/igt@gem_exec_suspend@basic-s3.html Happens with some frequency, setting to high/major. Worthy of note that we've only seen "DMAR write fault 7" on icl (afaict) -- possibly HW specific? Still occurs, but the incidence is not very high (3.5%). With that and the HW specific comment above, I'm lowering to medium. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/423. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.