Bug 111699 - [CI][BAT][iommu]igt@gem_exec_suspend@basic-s4-devices - fail - DMAR write fault 7 + Failed assertion: !"GPU hung"
Summary: [CI][BAT][iommu]igt@gem_exec_suspend@basic-s4-devices - fail - DMAR write fau...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2019-09-16 12:03 UTC by Lakshmi
Modified: 2019-10-04 19:26 UTC (History)
1 user (show)

See Also:
i915 platform: ICL
i915 features: GPU hang


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Lakshmi 2019-09-16 12:03:07 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6893/fi-icl-u3/igt@gem_exec_suspend@basic-s4-devices.html

Starting subtest: basic-S4-devices
(gem_exec_suspend:2432) igt_aux-CRITICAL: Test assertion failure function sig_abort, file ../lib/igt_aux.c:502:
(gem_exec_suspend:2432) igt_aux-CRITICAL: Failed assertion: !"GPU hung"
Subtest basic-S4-devices failed.
Comment 1 CI Bug Log 2019-09-16 12:05:27 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !"GPU hung"
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6893/fi-icl-u3/igt@gem_exec_suspend@basic-s4-devices.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Trybot_5020/fi-icl-u4/igt@gem_exec_suspend@basic-s4-devices.html
Comment 2 Chris Wilson 2019-09-16 12:13:47 UTC
Correlates with

<3> [90.027305] DMAR: DRHD: handling fault status reg 2
<3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000 [fault reason 07] Next page table ptr is invalid

In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution to the ring. I presume that's when the lookup failed.

DMA_FADDR: 0x00000000_00013870

didn't cross a page so what it was looking up that failed is a mystery.

"fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4
But write? gem_exec_suspend does use a scratch page to verify HW works across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that is absent.
Comment 3 Chris Wilson 2019-09-16 12:16:25 UTC
It's worth pointing out the failure was before the suspend; so do we have a coherency issue with the dma-mapping?
Comment 4 CI Bug Log 2019-09-17 09:25:56 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/igt@gem_exec_suspend@basic.html
Comment 5 Lakshmi 2019-09-17 09:29:55 UTC
(In reply to Chris Wilson from comment #2)
> Correlates with
> 
> <3> [90.027305] DMAR: DRHD: handling fault status reg 2
> <3> [90.027363] DMAR: [DMA Write] Request device [00:02.0] fault addr 43000
> [fault reason 07] Next page table ptr is invalid
> 
> In the GPU hang, it dies on MI_BATCH_BUFFER_END and doesn't return execution
> to the ring. I presume that's when the lookup failed.
> 
> DMA_FADDR: 0x00000000_00013870
> 
> didn't cross a page so what it was looking up that failed is a mystery.
> 
> "fault addr 43000" looks to be a reference to BBADDR: 0x00000000_00042fd4
> But write? gem_exec_suspend does use a scratch page to verify HW works
> across suspend (using MI_STORE_DWORD_IMM), so probably that's the page that
> is absent.

(In reply to CI Bug Log from comment #4)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion:
> !&quot;GPU hung&quot; -}
> {+ ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion:
> !&quot;GPU hung&quot; +}
> 
> New failures caught by the filter:
> 
>   *
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6903/shard-iclb6/
> igt@gem_exec_suspend@basic.html

One more instance, but on shards.
<3> [209.832569] DMAR: DRHD: handling fault status reg 2
<3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid

DMA_FADDR: 0x00000000_007f8b78
Comment 6 CI Bug Log 2019-09-17 09:30:22 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; +}


  No new failures caught with the new filter
Comment 7 Chris Wilson 2019-09-17 10:02:41 UTC
<3> [209.832569] DMAR: DRHD: handling fault status reg 2
<3> [209.832629] DMAR: [DMA Write] Request device [00:02.0] fault addr 41000 [fault reason 07] Next page table ptr is invalid
<7> [212.694638] [drm:edp_panel_vdd_off_sync [i915]] Turning [ENCODER:214:DDI A] VDD off
<7> [212.694912] [drm:edp_panel_vdd_off_sync [i915]] PP_STATUS: 0x80000008 PP_CONTROL: 0x00000067
<7> [212.694994] [drm:intel_power_well_disable [i915]] disabling DC off
<7> [212.695078] [drm:skl_enable_dc6 [i915]] Enabling DC6
<7> [212.695165] [drm:gen9_set_dc_state [i915]] Setting DC state from 00 to 02
<7> [217.942713] hangcheck bcs0
<7> [217.942719] hangcheck 	Awake? 2
<7> [217.942723] hangcheck 	Hangcheck: 6016 ms ago
<7> [217.942727] hangcheck 	Reset count: 0 (global 740)
<7> [217.942730] hangcheck 	Requests:
<7> [217.942743] hangcheck 		active  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942747] hangcheck 		ring->start:  0x007f6000
<7> [217.942750] hangcheck 		ring->head:   0x00002ae0
<7> [217.942754] hangcheck 		ring->tail:   0x00002b78
<7> [217.942757] hangcheck 		ring->emit:   0x000037c0
<7> [217.942760] hangcheck 		ring->space:  0x000032e0
<7> [217.942763] hangcheck 		ring->hwsp:   0xffffa180
<7> [217.942767] hangcheck [head 2b10, postfix 2b50, tail 2b80, batch 0x00000000_00040000]:
<7> [217.942789] hangcheck [0000] 13244002 00000204 00000000 00000000 02800000 00000000 10400002 ffffa180
<7> [217.942794] hangcheck [0020] 00000000 000001df 04000001 18800101 00040000 00000000 04000000 00000000
<7> [217.942798] hangcheck [0040] 13004002 ffffa184 00000000 000001e0 01000000 04000001 0e40c002 00000000
<7> [217.942802] hangcheck [0060] ffffd0c8 00000000 02800000 00000000
<7> [217.942812] hangcheck 	MMIO base:  0x00022000
<7> [217.942824] hangcheck 	RING_START: 0x007f6000
<7> [217.942830] hangcheck 	RING_HEAD:  0x00002b48
<7> [217.942837] hangcheck 	RING_TAIL:  0x00002b78
<7> [217.942846] hangcheck 	RING_CTL:   0x00003001
<7> [217.942856] hangcheck 	RING_MODE:  0x00000000
<7> [217.942863] hangcheck 	RING_IMR: 00000000
<7> [217.942882] hangcheck 	ACTHD:  0x00000000_00202b48
<7> [217.942895] hangcheck 	BBADDR: 0x00000000_00040fd4
<7> [217.942908] hangcheck 	DMA_FADDR: 0x00000000_007f8b78
<7> [217.942915] hangcheck 	IPEIR: 0x00000000
<7> [217.942921] hangcheck 	IPEHR: 0x05000000
<7> [217.942932] hangcheck 	Execlist status: 0x00001098 60000020, entries 12
<7> [217.942936] hangcheck 	Execlist CSB read 3, write 3, tasklet queued? no (enabled)
<7> [217.942943] hangcheck 		Active[0: ring:{start:007f6000, hwsp:ffffa180, seqno:000001df}, rq:  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942950] hangcheck 		E  617:1e0*-  prio=3 @ 8110ms: gem_exec_suspen[2283]
<7> [217.942953] hangcheck HWSP:
<7> [217.942958] hangcheck [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [217.942962] hangcheck *
<7> [217.942967] hangcheck [0040] 00000018 60000020 00000001 60000000 00000018 60000020 00000001 60000000
<7> [217.942970] hangcheck *
<7> [217.942974] hangcheck [00a0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000003
<7> [217.942979] hangcheck [00c0] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [217.942982] hangcheck *
<7> [217.942989] hangcheck Idle? no

So it the fault addr of 0x41000 is matching the page after the batch (BBADDR: 0x40fd4). The write is puzzling. The BBADDR is close enough to the page boundary for the 128-byte prefetch to cross into the next page, but it should not be a write for the CS parser. And it should happily be a scratch page, or the store buffer.
Comment 8 CI Bug Log 2019-10-04 07:33:10 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; -}
{+ ICL: igt@gem_exec_suspend@basic-s4-devices|igt@gem_exec_suspend@basic - fail - Failed assertion: !&quot;GPU hung&quot; +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6996/shard-iclb2/igt@gem_exec_suspend@basic-s3.html
Comment 9 Francesco Balestrieri 2019-10-04 11:13:34 UTC
Happens with some frequency, setting to high/major.
Comment 10 Chris Wilson 2019-10-04 19:26:30 UTC
Worthy of note that we've only seen "DMAR write fault 7" on icl (afaict) -- possibly HW specific?


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.