Bug 111699

Summary: [CI][BAT][iommu]igt@gem_exec_suspend@basic-s4-devices - fail - DMAR write fault 7 + Failed assertion: !"GPU hung"
Product: DRI Reporter: Lakshmi <lakshminarayana.vudum>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED MOVED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: ICL i915 features: GPU hang, power/runtime PM

Description Lakshmi 2019-09-16 12:03:07 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6893/fi-icl-u3/igt@gem_exec_suspend@basic-s4-devices.html

Starting subtest: basic-S4-devices
(gem_exec_suspend:2432) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:502:
(gem_exec_suspend:2432) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-S4-devices failed.
Comment 1 CI Bug Log 2019-09-16 12:05:27 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot;
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6893/fi-icl-u3/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_5020/fi-icl-u4/igt@gem_exec_suspend@basic-s4-devices.html
Comment 2 Chris Wilson 2019-09-16 12:13:47 UTC
Correlates with

<3> [90.027305] DMAR: DRHD: handling fault status reg 2
<3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000 [fault reason 07] Next page table ptr is invalid

In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution to the ring. I presume that's when the lookup failed.

DMA_FADDR: 0x00000000_00013870

didn't cross a page so what it was looking up that failed is a mystery.

"fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4
But write? gem_exec_suspend does use a scratch page to verify HW works across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that is absent.
Comment 3 Chris Wilson 2019-09-16 12:16:25 UTC
It's worth pointing out the failure was before the suspend; so do we have a coherency issue with the dma-mapping?
Comment 4 CI Bug Log 2019-09-17 09:25:56 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/igt@gem_exec_suspend@basic.html
Comment 5 Lakshmi 2019-09-17 09:29:55 UTC
(In reply to Chris Wilson from comment #2)
> Correlates with
> 
> <3> [90.027305] DMAR: DRHD: handling fault status reg 2
> <3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000
> [fault reason 07] Next page table ptr is invalid
> 
> In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution
> to the ring. I presume that's when the lookup failed.
> 
> DMA_FADDR: 0x00000000_00013870
> 
> didn't cross a page so what it was looking up that failed is a mystery.
> 
> "fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4
> But write? gem_exec_suspend does use a scratch page to verify HW works
> across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that
> is absent.

(In reply to CI Bug Log from comment #4)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion:
> !&quot;GPU hung&quot; -}
> {+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion:
> !&quot;GPU hung&quot; +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/
> igt@gem_exec_suspend@basic.html

One more instance, but on shards.
<3> [209.832569] DMAR: DRHD: handling fault status reg 2
<3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid

DMA_FADDR: 0x00000000_007f8b78
Comment 6 CI Bug Log 2019-09-17 09:30:22 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; +}


  No new failures caught with the new filter
Comment 7 Chris Wilson 2019-09-17 10:02:41 UTC
<3> [209.832569] DMAR: DRHD: handling fault status reg 2
<3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid
<7> [212.694638] [drm:edp_panel_vdd_off_sync [i915]] Turning [ENCODER:214:DDI A] VDD off
<7> [212.694912] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067
<7> [212.694994] [drm:intel_power_well_disable [i915]] disabling DC off
<7> [212.695078] [drm:skl_enable_dc6 [i915]] Enabling DC6
<7> [212.695165] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 02
<7> [217.942713] hangcheck bcs0
<7> [217.942719] hangcheck 	Awake? 2
<7> [217.942723] hangcheck 	Hangcheck: 6016 ms ago
<7> [217.942727] hangcheck 	Reset count: 0 (global 740)
<7> [217.942730] hangcheck 	Requests:
<7> [217.942743] hangcheck 		active  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942747] hangcheck 		ring->start:  0x007f6000
<7> [217.942750] hangcheck 		ring->head:   0x00002ae0
<7> [217.942754] hangcheck 		ring->tail:   0x00002b78
<7> [217.942757] hangcheck 		ring->emit:   0x000037c0
<7> [217.942760] hangcheck 		ring->space:  0x000032e0
<7> [217.942763] hangcheck 		ring->hwsp:   0xffffa180
<7> [217.942767] hangcheck [head 2b10, postfix 2b50, tail 2b80, batch 0x00000000_00040000]:
<7> [217.942789] hangcheck [0000] 13244002 00000204 00000000 00000000 02800000 00000000 10400002 ffffa180
<7> [217.942794] hangcheck [0020] 00000000 000001df 04000001 18800101 00040000 00000000 04000000 00000000
<7> [217.942798] hangcheck [0040] 13004002 ffffa184 00000000 000001e0 01000000 04000001 0e40c002 00000000
<7> [217.942802] hangcheck [0060] ffffd0c8 00000000 02800000 00000000
<7> [217.942812] hangcheck 	MMIO base:  0x00022000
<7> [217.942824] hangcheck 	RING_START: 0x007f6000
<7> [217.942830] hangcheck 	RING_HEAD:  0x00002b48
<7> [217.942837] hangcheck 	RING_TAIL:  0x00002b78
<7> [217.942846] hangcheck 	RING_CTL:   0x00003001
<7> [217.942856] hangcheck 	RING_MODE:  0x00000000
<7> [217.942863] hangcheck 	RING_IMR: 00000000
<7> [217.942882] hangcheck 	ACTHD:  0x00000000_00202b48
<7> [217.942895] hangcheck 	BBADDR: 0x00000000_00040fd4
<7> [217.942908] hangcheck 	DMA_FADDR: 0x00000000_007f8b78
<7> [217.942915] hangcheck 	IPEIR: 0x00000000
<7> [217.942921] hangcheck 	IPEHR: 0x05000000
<7> [217.942932] hangcheck 	Execlist status: 0x00001098 60000020, entries 12
<7> [217.942936] hangcheck 	Execlist CSB read 3, write 3, tasklet queued? no (enabled)
<7> [217.942943] hangcheck 		Active[0: ring:{start:007f6000, hwsp:ffffa180, seqno:000001df}, rq:  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942950] hangcheck 		E  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942953] hangcheck HWSP:
<7> [217.942958] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [217.942962] hangcheck *
<7> [217.942967] hangcheck [0040] 00000018 60000020 00000001 60000000 00000018 60000020 00000001 60000000
<7> [217.942970] hangcheck *
<7> [217.942974] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003
<7> [217.942979] hangcheck [00c0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [217.942982] hangcheck *
<7> [217.942989] hangcheck Idle? no

So it the fault addr of 0x41000 is matching the page after the batch (BBADDR: 0x40fd4). The write is puzzling. The BBADDR is close enough to the page boundary for the 128-byte prefetch to cross into the next page, but it should not be a write for the CS parser. And it should happily be a scratch page, or the store buffer.
Comment 8 CI Bug Log 2019-10-04 07:33:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6996/shard-iclb2/igt@gem_exec_suspend@basic-s3.html
Comment 9 Francesco Balestrieri 2019-10-04 11:13:34 UTC
Happens with some frequency, setting to high/major.
Comment 10 Chris Wilson 2019-10-04 19:26:30 UTC
Worthy of note that we've only seen "DMAR write fault 7" on icl (afaict) -- possibly HW specific?
Comment 11 Francesco Balestrieri 2019-11-11 09:59:26 UTC
Still occurs, but the incidence is not very high (3.5%). With that and the HW specific comment above, I'm lowering to medium.
Comment 12 Martin Peres 2019-11-29 19:28:48 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/423.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.