Bug 112157

Summary:	[CI][SHARDS]igt@kms_frontbuffer_tracking@psr-modesetfrombusy - dmesg-warn - GEM_BUG_ON(((tail) & ~((__typeof__(tail))((64)-1))) == ((ring->head) & ~((__typeof__(ring->head))((64)-1))) && tail < ring->head)
Product:	DRI	Reporter:	Lakshmi <lakshminarayana.vudum>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	RESOLVED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	not set
Priority:	highest	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	SKL	i915 features:	GEM/Other

Description Lakshmi 2019-10-28 16:38:12 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7181/shard-skl5/igt@kms_frontbuffer_tracking@psr-modesetfrombusy.html
<3> [3763.145559] assert_ring_tail_valid:101 GEM_BUG_ON(((tail) & ~((__typeof__(tail))((64)-1))) == ((ring->head) & ~((__typeof__(ring->head))((64)-1))) && tail < ring->head)
<4> [3763.147091] ------------[ cut here ]------------
<2> [3763.147119] kernel BUG at drivers/gpu/drm/i915/gt/intel_ring.h:101!
<4> [3763.147183] invalid opcode: 0000 [#1] PREEMPT SMP PTI
<4> [3763.147207] CPU: 0 PID: 6422 Comm: kworker/0:5 Tainted: G     U            5.4.0-rc4-CI-CI_DRM_7181+ #1
<4> [3763.147220] Hardware name: Google Caroline/Caroline, BIOS MrChromebox 08/27/2018
<4> [3763.147245] Workqueue: events delayed_fput
<4> [3763.147593] RIP: 0010:assert_ring_tail_valid.part.38+0x45/0x50 [i915]
<7> [3763.147930] [drm:intel_atomic_check [i915]] [CONNECTOR:95:eDP-1] Limiting display bpp to 24 instead of EDID bpp 24, requested bpp 36, max platform bpp 36
<4> [3763.147962] Code: aa 93 f1 e0 48 8b 35 32 cc 25 00 49 c7 c0 58 5a 39 a0 b9 65 00 00 00 48 c7 c2 d0 ea 36 a0 48 c7 c7 50 1c 21 a0 e8 7b 8f f8 e0 <0f> 0b 66 0f 1f 84 00 00 00 00 00 8b 47 2c f7 d8 83 c8 07 85 f0 74
<4> [3763.147980] RSP: 0018:ffffc90000003e00 EFLAGS: 00010086
<4> [3763.148002] RAX: 0000000000000025 RBX: ffff8881444c79c0 RCX: 0000000000000000
<4> [3763.148018] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff88817a097000
<4> [3763.148034] RBP: ffffc90010eab0e0 R08: 00000000003c2541 R09: ffff888161efb000
<4> [3763.148046] R10: 00000000fc0219bd R11: ffff88817a097000 R12: ffffc90010eab108
<4> [3763.148055] R13: 0000000000000001 R14: ffff8881444c0040 R15: ffff88816cc92000
<4> [3763.148067] FS:  0000000000000000(0000) GS:ffff88817aa00000(0000) knlGS:0000000000000000
<4> [3763.148077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [3763.148087] CR2: 00007f8c17dc3470 CR3: 0000000134c60005 CR4: 00000000003606f0
<4> [3763.148093] Call Trace:
<4> [3763.148105]  <IRQ>
<4> [3763.148440]  assert_ring_tail_valid+0x6a/0x70 [i915]
<7> [3763.148781] [drm:intel_dp_compute_config [i915]] DP link computation with max lane count 4 max rate 270000 max bpp 24 pixel clock 252750KHz
<4> [3763.149134]  gen8_emit_fini_breadcrumb+0xa3/0x1b0 [i915]
<4> [3763.149549]  __i915_request_submit+0x29b/0x430 [i915]
<4> [3763.149911]  __execlists_submission_tasklet+0x6a6/0x16a0 [i915]
<4> [3763.150260]  execlists_submission_tasklet+0xf2/0x110 [i915]
<4> [3763.150286]  tasklet_action_common.isra.18+0x6c/0x1c0
<4> [3763.150311]  __do_softirq+0xdf/0x47f
<4> [3763.150339]  irq_exit+0xba/0xc0
<7> [3763.150678] [drm:intel_dp_compute_config [i915]] Force DSC en = 0
<4> [3763.150710]  do_IRQ+0x83/0x160
<4> [3763.150748]  common_interrupt+0xf/0xf
<4> [3763.150765]  </IRQ>
<4> [3763.150791] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60
<4> [3763.150810] Code: c7 02 75 1f 53 9d e8 91 86 7b ff bf 01 00 00 00 e8 57 f0 6f ff 65 8b 05 28 2e 63 7e 85 c0 74 0c 5b 5d c3 e8 74 84 7b ff 53 9d <eb> df e8 7b eb 61 ff 5b 5d c3 0f 1f 84 00 00 00 00 00 53 48 89 fb
<4> [3763.150826] RSP: 0018:ffffc900009afb70 EFLAGS: 00000202 ORIG_RAX: ffffffffffffffdb
<4> [3763.150847] RAX: ffff88812a50a840 RBX: 0000000000000202 RCX: 0000000000000000
<4> [3763.150862] RDX: 00000000000005d2 RSI: ffff88812a50b0f8 RDI: ffffffff8213be21
<4> [3763.150878] RBP: ffff88810ea70890 R08: ffff88812a50b0f8 R09: 0000000000000000
<4> [3763.150893] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88810ea70890
<4> [3763.150908] R13: 0000000000000202 R14: ffffea000572a740 R15: 000000000000000b
<4> [3763.150965]  delete_from_page_cache+0x4b/0x70
<4> [3763.150995]  truncate_inode_page+0x1d/0x30
<4> [3763.151017]  shmem_undo_range+0x1d5/0x890
<4> [3763.151112]  shmem_truncate_range+0x11/0x30
<4> [3763.151135]  shmem_evict_inode+0xe1/0x250
<4> [3763.151158]  ? evict+0xb5/0x190
<4> [3763.151191]  evict+0xcb/0x190
<4> [3763.151213]  __dentry_kill+0xd7/0x180
<4> [3763.151237]  __fput+0x113/0x250
<4> [3763.151275]  delayed_fput+0x17/0x30
<4> [3763.151305]  process_one_work+0x26a/0x620
<7> [3763.151659] [drm:intel_dp_compute_config [i915]] DP lane count 4 clock 270000 bpp 24
<4> [3763.151708]  worker_thread+0x37/0x380
<4> [3763.151736]  ? process_one_work+0x620/0x620
<4> [3763.151767]  kthread+0x119/0x130
<4> [3763.151791]  ? kthread_park+0x80/0x80
<4> [3763.151826]  ret_from_fork+0x3a/0x50
<4> [3763.151876] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_intel snd_intel_nhlt snd_hda_codec x86_pkg_temp_thermal i915 coretemp snd_hwdep crct10dif_pclmul crc32_pclmul snd_hda_core ghash_clmulni_intel btusb btrtl snd_pcm btbcm btintel bluetooth cdc_ether usbnet r8152 mii ecdh_generic ecc i2c_hid pinctrl_sunrisepoint pinctrl_intel prime_numbers
<0> [3763.151977] Dumping ftrace buffer:
<7> [3763.152386] [drm:intel_dp_compute_config [i915]] DP link rate required 758250 available 1080000
<0> [3763.152390] ---------------------------------

Comment 1 CI Bug Log 2019-10-28 16:39:52 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL:  igt@kms_frontbuffer_tracking@psr-modesetfrombusy - dmesg-warn -  GEM_BUG_ON(((tail) &amp; ~((__typeof__(tail))((64)-1))) == ((ring-&gt;head) &amp; ~((__typeof__(ring-&gt;head))((64)-1))) &amp;&amp; tail &lt; ring-&gt;head)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7181/shard-skl5/igt@kms_frontbuffer_tracking@psr-modesetfrombusy.html

Comment 2 Chris Wilson 2019-10-29 11:41:18 UTC

At least this has happened more than once, so we may be in luck and be able to add some more debug.

Comment 3 Chris Wilson 2019-10-29 11:49:01 UTC

Same run:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7181/shard-apl6/pstore15-1572035145_Panic_2.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7181/shard-glk2/pstore22-1572035113_Panic_2.log
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_7181/shard-kbl3/pstore20-1572041883_Panic_2.log

Comment 4 Chris Wilson 2019-10-29 11:54:23 UTC

<0>[ 2016.271448]   <idle>-0       0..s1 2012175800us : process_csb: rcs0 cs-irq head=0, tail=1
<0>[ 2016.271485]   <idle>-0       0..s1 2012175800us : process_csb: rcs0 csb[1]: status=0x00000882:0x00000060
<0>[ 2016.271524]   <idle>-0       0..s1 2012175801us : trace_ports: rcs0: preempted { 3d30:32, 0:0 }
<0>[ 2016.271561] gem_exec-8517    1d..1 2012175801us : __execlists_submission_tasklet: vcs1: queue_priority_hint:-2147483648, submit:yes
<0>[ 2016.271601]   <idle>-0       0..s1 2012175801us : process_csb: reset_active(rcs0): { rq=3d30:32 }
<0>[ 2016.271639] gem_exec-8517    1d..1 2012175802us : trace_ports: vcs1: submit { a:428, 0:0 }
<0>[ 2016.271683] gem_exec-8517    1.... 2012175811us : __i915_request_commit: vecs0 fence c:608
<0>[ 2016.271721]   <idle>-0       0..s1 2012175811us : trace_ports: rcs0: promote { 4:9442!, 0:0 }
<0>[ 2016.271765]   <idle>-0       0d.s2 2012175813us : __i915_request_submit: rcs0 fence 3d30:12, current 10
<0>[ 2016.271810]   <idle>-0       0d.s2 2012175820us : __i915_request_submit: rcs0 fence 3d30:14, current 10
<0>[ 2016.271855]   <idle>-0       0d.s2 2012175825us : __i915_request_submit: rcs0 fence 3d30:16, current 10
<0>[ 2016.271900] gem_exec-8517    1d..1 2012175825us : __i915_request_submit: vecs0 fence c:608, current 606
<0>[ 2016.271944]   <idle>-0       0d.s2 2012175829us : __i915_request_submit: rcs0 fence 3d30:18, current 10
<0>[ 2016.271985] gem_exec-8517    1d..1 2012175831us : __execlists_submission_tasklet: vecs0: queue_priority_hint:-2147483648, submit:yes
<0>[ 2016.272027] gem_exec-8517    1d..1 2012175832us : trace_ports: vecs0: submit { c:608, 0:0 }
<0>[ 2016.272070]   <idle>-0       0d.s2 2012175833us : __i915_request_submit: rcs0 fence 3d30:20, current 10
<0>[ 2016.272115]   <idle>-0       0d.s2 2012175837us : __i915_request_submit: rcs0 fence 3d30:22, current 10
<0>[ 2016.272159]   <idle>-0       0d.s2 2012175840us : __i915_request_submit: rcs0 fence 3d30:24, current 10
<0>[ 2016.272203]   <idle>-0       0d.s2 2012175843us : __i915_request_submit: rcs0 fence 3d30:26, current 10
<0>[ 2016.272247]   <idle>-0       0d.s2 2012175846us : __i915_request_submit: rcs0 fence 3d30:28, current 10
<0>[ 2016.272292]   <idle>-0       0d.s2 2012175850us : __i915_request_submit: rcs0 fence 3d30:30, current 10
<0>[ 2016.272331] gem_exec-8517    1.... 2012176048us : __intel_context_do_pin: rcs0 context:3d2e pin ring:{head:0000, tail:0000}
<0>[ 2016.272370] gem_exec-8517    1.... 2012176053us : intel_context_unpin: rcs0 context:3d2e retire
<0>[ 2016.272409]   <idle>-0       0d.s2 2012176132us : assert_ring_tail_valid.part.38: assert_ring_tail_valid:101 GEM_BUG_ON(((tail) & ~((__typeof__(tail))((64)-1))) == ((ring->head) & ~((__typeof__(ring->head))((64)-1))) && tail < ring->head)

So it's the rogue i915.hangcheck=0 leading to context cancellation.

I believe fixed by

commit a7f328fc789817a6a0e5c46411956810d5ee00ca
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Oct 28 12:41:25 2019 +0000

    drm/i915/execlists: Simply walk back along request timeline on reset
    
    The request's timeline will only contain requests from this context, in
    order of execution. Therefore, we can simply look back along this
    timeline to find the currently executing request.
    
    If we do find that the current context has completed its last request,
    that does not imply that all requests are completed in the context, so
    only advance the ring->head up to the end of the known completions!
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191028124125.25176-1-chris@chris-wilson.co.uk

The dangerous part is that this is precipitated by hangcheck=0 and so will not appear again in CI... That suggests I need a better smoketest for persistence opt-out.

Comment 5 CI Bug Log 2019-10-29 16:49:01 UTC

A CI Bug Log filter associated to this bug has been updated:

{- SKL:  igt@kms_frontbuffer_tracking@psr-modesetfrombusy - dmesg-warn -  GEM_BUG_ON(((tail) &amp; ~((__typeof__(tail))((64)-1))) == ((ring-&gt;head) &amp; ~((__typeof__(ring-&gt;head))((64)-1))) &amp;&amp; tail &lt; ring-&gt;head) -}
{+ SKL:  igt@kms_frontbuffer_tracking@psr-modesetfrombusy - dmesg-warn -  GEM_BUG_ON(((tail) &amp; ~((__typeof__(tail))((64)-1))) == ((ring-&gt;head) &amp; ~((__typeof__(ring-&gt;head))((64)-1))) &amp;&amp; tail &lt; ring-&gt;head) +}


  No new failures caught with the new filter

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.