Bug 111895

Summary: [CI][BAT] igt@i915_selftest@live_execlists/live_unlite_switch - dmesg-fail: timed out, cancelling all further testing
Product: DRI Reporter: Imre Deak <imre.deak>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: RESOLVED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: not set CC: chris, intel-gfx-bugs
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard:
i915 platform: BSW/CHT i915 features:

Description Imre Deak 2019-10-03 12:16:14 UTC
https://bugs.freedesktop.org/show_bug.cgi?id=111108

<7>[  538.919490] [IGT] i915_selftest: starting subtest live_execlists
...
<6>[  539.473261] i915: Performing live selftests with st_random_seed=0xe061e48a st_timeout=1000
...
<6>[  539.544358] i915: Running intel_execlists_live_selftests/live_unlite_switch
...
<3>[  539.772486] igt_live_test_end+0x1f/0xf0 [i915] timed out, cancelling all further testing.
Comment 1 Chris Wilson 2019-10-03 12:19:16 UTC
(In reply to Imre Deak from comment #0)
> https://bugs.freedesktop.org/show_bug.cgi?id=111108

Absolutely nothing to do with that.
Comment 2 Imre Deak 2019-10-03 12:22:51 UTC
(In reply to Chris Wilson from comment #1)
> (In reply to Imre Deak from comment #0)
> > https://bugs.freedesktop.org/show_bug.cgi?id=111108
> 
> Absolutely nothing to do with that.

Arg, wrong URL I copied. The correct one:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_14640/fi-bsw-n3050/igt@i915_selftest@live_execlists.html
Comment 3 CI Bug Log 2019-10-03 12:37:46 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* BSW: igt@i915_selftest@live_execlists - dmesg-fail - igt_live_test_end+0x1f/0xf0 [i915] timed out
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6994/fi-bsw-kefka/igt@i915_selftest@live_execlists.html
  - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_14640/fi-bsw-n3050/igt@i915_selftest@live_execlists.html
Comment 4 Chris Wilson 2019-10-03 13:13:31 UTC
<0> [561.060235] i915_sel-4367    0.... 539658733us : __i915_subtests: Running intel_execlists_live_selftests/live_unlite_switch
<0> [561.060594] i915_sel-4367    0.... 539662660us : __intel_context_do_pin: rcs0 context:efc0 pin ring:{head:0000, tail:0000}
<0> [561.060953] i915_sel-4367    0.... 539664389us : __intel_context_do_pin: rcs0 context:efc1 pin ring:{head:0000, tail:0000}
<0> [561.061312] i915_sel-4367    0.... 539665442us : __engine_unpark: rcs0
<0> [561.061671] i915_sel-4367    0.... 539665496us : __gt_unpark: 
<0> [561.062057] i915_sel-4367    0.... 539669438us : __i915_request_commit: rcs0 fence efc0:2
<0> [561.062444] i915_sel-4367    0d..1 539669467us : __i915_request_submit: rcs0 fence efc0:2, current 0
<0> [561.062810] i915_sel-4367    0d..1 539669474us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.063177] i915_sel-4367    0d..1 539669479us : trace_ports: rcs0: submit { efc0:2, 0:0 }
<0> [561.063544] jbd2/sda-215     1..s. 539669531us : process_csb: rcs0 cs-irq head=3, tail=4
<0> [561.063910] jbd2/sda-215     1..s. 539669535us : process_csb: rcs0 csb[4]: status=0x00000001:0x00000000
<0> [561.064279] jbd2/sda-215     1..s. 539669539us : trace_ports: rcs0: promote { efc0:2*, 0:0 }
<0> [561.064646] jbd2/sda-215     1d.s1 539669547us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:no
<0> [561.065032] i915_sel-4367    0.... 539669654us : __i915_request_commit: rcs0 fence efc1:2
<0> [561.065419] i915_sel-4367    0d..1 539669692us : __i915_request_submit: rcs0 fence efc1:2, current 0
<0> [561.065787] i915_sel-4367    0d..1 539669697us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.066155] i915_sel-4367    0d..1 539669701us : trace_ports: rcs0: submit { efc0:2*, efc1:2 }
<0> [561.066542] i915_sel-4367    0.... 539669796us : __i915_request_commit: rcs0 fence efc0:4
<0> [561.066908] ksoftirq-16      1..s. 539669909us : process_csb: rcs0 cs-irq head=4, tail=5
<0> [561.067274] ksoftirq-16      1..s. 539669911us : process_csb: rcs0 csb[5]: status=0x00008002:0x00000001
<0> [561.067642] ksoftirq-16      1..s. 539669914us : trace_ports: rcs0: preempted { efc0:2*, 0:0 }
<0> [561.068009] ksoftirq-16      1..s. 539669916us : trace_ports: rcs0: promote { efc0:2*, efc1:2 }
<0> [561.068377] ksoftirq-16      1d.s1 539669948us : __execlists_submission_tasklet: rcs0: expired last=efc0:2, prio=3, hint=3
<0> [561.068764] ksoftirq-16      1d.s1 539669952us : __i915_request_unsubmit: rcs0 fence efc1:2, current 0
<0> [561.069151] ksoftirq-16      1d.s1 539670042us : __i915_request_unsubmit: rcs0 fence efc0:2, current 1
<0> [561.069537] ksoftirq-16      1d.s1 539670050us : __i915_request_submit: rcs0 fence efc0:2, current 1
<0> [561.069924] ksoftirq-16      1d.s1 539670056us : __i915_request_submit: rcs0 fence efc1:2, current 0
<0> [561.070290] ksoftirq-16      1d.s1 539670060us : __execlists_submission_tasklet: rcs0: queue_priority_hint:2, submit:yes
<0> [561.070657] ksoftirq-16      1d.s1 539670062us : trace_ports: rcs0: submit { efc0:2*, efc1:2 }
<0> [561.071043] i915_sel-4367    0.... 539670069us : i915_gem_wait_for_idle: flags=2 (locked), timeout=200
<0> [561.071408] ksoftirq-16      1..s. 539670171us : process_csb: rcs0 cs-irq head=5, tail=0
<0> [561.071775] ksoftirq-16      1..s. 539670172us : process_csb: rcs0 csb[0]: status=0x00008002:0x00000001
<0> [561.072142] ksoftirq-16      1..s. 539670174us : trace_ports: rcs0: preempted { efc0:2!, efc1:2 }
<0> [561.072509] ksoftirq-16      1..s. 539670177us : trace_ports: rcs0: promote { efc0:2!, efc1:2 }
<0> [561.072896] ksoftirq-16      1d.s1 539670184us : __i915_request_submit: rcs0 fence efc0:4, current 2
<0> [561.073263] ksoftirq-16      1d.s1 539670188us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.073630] ksoftirq-16      1d.s1 539670190us : trace_ports: rcs0: submit { efc1:2, efc0:4 }
<0> [561.073996] ksoftirq-16      1..s. 539670222us : process_csb: rcs0 cs-irq head=0, tail=1
<0> [561.074363] ksoftirq-16      1..s. 539670224us : process_csb: rcs0 csb[1]: status=0x00000814:0x00000001
<0> [561.074730] ksoftirq-16      1..s. 539670325us : trace_ports: rcs0: completed { efc0:2!, efc1:2 }
<0> [561.075095] ksoftirq-16      1..s. 539670329us : process_csb: rcs0 cs-irq head=1, tail=4
<0> [561.075462] ksoftirq-16      1..s. 539670330us : process_csb: rcs0 csb[2]: status=0x00008002:0x00000001
<0> [561.075829] ksoftirq-16      1..s. 539670331us : trace_ports: rcs0: preempted { efc1:2!, 0:0 }
<0> [561.076197] ksoftirq-16      1..s. 539670333us : trace_ports: rcs0: promote { efc1:2!, efc0:4 }
<0> [561.076564] ksoftirq-16      1..s. 539670334us : process_csb: rcs0 csb[3]: status=0x00000814:0x00000001
<0> [561.076931] ksoftirq-16      1..s. 539670335us : trace_ports: rcs0: completed { efc1:2!, efc0:4 }
<0> [561.077298] ksoftirq-16      1..s. 539670338us : process_csb: rcs0 csb[4]: status=0x00000818:0x00000001
<0> [561.077665] ksoftirq-16      1..s. 539670340us : trace_ports: rcs0: completed { efc0:4!, 0:0 }
<0> [561.078032] ksoftirq-16      1d.s1 539670342us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:no
<0> [561.078418] i915_sel-4367    0.... 539670348us : i915_request_retire: rcs0 fence efc1:2, current 2
<0> [561.078776] i915_sel-4367    0.... 539670357us : intel_context_unpin: rcs0 context:efc1 retire
<0> [561.079135] i915_sel-4367    0.... 539670418us : __intel_context_retire: rcs0 context:efc1 retire
<0> [561.079522] i915_sel-4367    0.... 539671236us : i915_request_retire: rcs0 fence efc0:2, current 4
<0> [561.079909] i915_sel-4367    0.... 539671302us : i915_request_retire: rcs0 fence efc0:4, current 4
<0> [561.080295] i915_sel-4367    0d... 539671354us : __i915_request_commit: rcs0 fence efb0:8
<0> [561.080681] i915_sel-4367    0d..1 539671396us : __i915_request_submit: rcs0 fence efb0:8, current 6
<0> [561.081049] i915_sel-4367    0d..1 539671421us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.081416] i915_sel-4367    0d..1 539671427us : trace_ports: rcs0: submit { efb0:8, 0:0 }
<0> [561.081774] i915_sel-4367    0.... 539671438us : intel_context_unpin: rcs0 context:efc0 retire
<0> [561.082132] i915_sel-4367    0.... 539671498us : __intel_context_retire: rcs0 context:efc0 retire
<0> [561.082497] kworker/-155     1..s. 539671601us : process_csb: rcs0 cs-irq head=4, tail=5
<0> [561.082864] kworker/-155     1..s. 539671603us : process_csb: rcs0 csb[5]: status=0x00000001:0x00000000
<0> [561.083231] kworker/-155     1..s. 539671606us : trace_ports: rcs0: promote { efb0:8!, 0:0 }
<0> [561.083597] kworker/-155     1d.s1 539671611us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:no
<0> [561.083963] kworker/-155     1..s. 539671613us : process_csb: rcs0 cs-irq head=5, tail=0
<0> [561.084330] kworker/-155     1..s. 539671614us : process_csb: rcs0 csb[0]: status=0x00000818:0x00000000
<0> [561.084697] kworker/-155     1..s. 539671616us : trace_ports: rcs0: completed { efb0:8!, 0:0 }
<0> [561.085064] kworker/-155     1d.s1 539671620us : __execlists_submission_tasklet: rcs0: queue_priority_hint:-2147483648, submit:no
<0> [561.085450] i915_sel-4367    0.N.. 539672266us : i915_request_retire: rcs0 fence efb0:8, current 8
<0> [561.085809] i915_sel-4367    0.... 539672832us : __engine_park: rcs0
<0> [561.086170] kworker/-2966    0.... 539672877us : __gt_park: 
<0> [561.086529] i915_sel-4367    0.... 539676058us : __intel_context_do_pin: bcs0 context:efc2 pin ring:{head:0000, tail:0000}
<0> [561.086889] i915_sel-4367    0.... 539677466us : __intel_context_do_pin: bcs0 context:efc3 pin ring:{head:0000, tail:0000}
<0> [561.087247] i915_sel-4367    0.... 539677497us : __engine_unpark: bcs0
<0> [561.087607] i915_sel-4367    0.... 539677511us : __gt_unpark: 
<0> [561.087993] i915_sel-4367    0.... 539679446us : __i915_request_commit: bcs0 fence efc2:2
<0> [561.088380] i915_sel-4367    0d..1 539679473us : __i915_request_submit: bcs0 fence efc2:2, current 0
<0> [561.088747] i915_sel-4367    0d..1 539679480us : __execlists_submission_tasklet: bcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.089115] i915_sel-4367    0d..1 539679485us : trace_ports: bcs0: submit { efc2:2, 0:0 }
<0> [561.089481] jbd2/sda-215     1.Ns1 539679542us : process_csb: bcs0 cs-irq head=3, tail=4
<0> [561.089848] jbd2/sda-215     1.Ns1 539679544us : process_csb: bcs0 csb[4]: status=0x00000001:0x00000000
<0> [561.090215] jbd2/sda-215     1.Ns1 539679548us : trace_ports: bcs0: promote { efc2:2*, 0:0 }
<0> [561.090581] jbd2/sda-215     1dNs2 539679555us : __execlists_submission_tasklet: bcs0: queue_priority_hint:-2147483648, submit:no
<0> [561.090967] i915_sel-4367    0.... 539679561us : __i915_request_commit: bcs0 fence efc3:2
<0> [561.091355] i915_sel-4367    0d..1 539679585us : __i915_request_submit: bcs0 fence efc3:2, current 0
<0> [561.091721] i915_sel-4367    0d..1 539679590us : __execlists_submission_tasklet: bcs0: queue_priority_hint:-2147483648, submit:yes
<0> [561.092088] i915_sel-4367    0d..1 539679593us : trace_ports: bcs0: submit { efc2:2*, efc3:2 }
<0> [561.092474] i915_sel-4367    0.... 539679674us : __i915_request_commit: bcs0 fence efc2:4
<0> [561.092839] ksoftirq-16      1..s. 539679717us : process_csb: bcs0 cs-irq head=4, tail=5
<0> [561.093205] ksoftirq-16      1..s. 539679718us : process_csb: bcs0 csb[5]: status=0x00008802:0x00000001
<0> [561.093573] ksoftirq-16      1..s. 539679721us : trace_ports: bcs0: preempted { efc2:2*, 0:0 }
<0> [561.093940] ksoftirq-16      1..s. 539679724us : trace_ports: bcs0: promote { efc2:2*, efc3:2 }
<0> [561.094311] ksoftirq-16      1d.s1 539679731us : __execlists_submission_tasklet: bcs0: expired last=efc2:2, prio=3, hint=3
<0> [561.094698] ksoftirq-16      1d.s1 539679734us : __i915_request_unsubmit: bcs0 fence efc3:2, current 0
<0> [561.095086] ksoftirq-16      1d.s1 539679739us : __i915_request_unsubmit: bcs0 fence efc2:2, current 1
<0> [561.095473] ksoftirq-16      1d.s1 539679746us : __i915_request_submit: bcs0 fence efc2:2, current 1
<0> [561.095860] ksoftirq-16      1d.s1 539679752us : __i915_request_submit: bcs0 fence efc3:2, current 0
<0> [561.096227] ksoftirq-16      1d.s1 539679756us : __execlists_submission_tasklet: bcs0: queue_priority_hint:2, submit:yes
<0> [561.096595] ksoftirq-16      1d.s1 539679758us : trace_ports: bcs0: submit { efc2:2*, efc3:2 }
<0> [561.096977] i915_sel-4367    0.... 539679795us : i915_gem_wait_for_idle: flags=2 (locked), timeout=200
<0> [561.097678] i915_sel-4367    0.... 539887173us : igt_flush_test: igt_live_test_end+0x1f/0xf0 [i915] timed out.

That's the entire sequence. Looks suspiciously like we succeeded in killing the HW as intended...

There is a redundant preemption we can fixup though.
Comment 5 Chris Wilson 2019-10-03 15:59:43 UTC
In another example,

<0> [629.616137] i915_sel-4301    0.... 618083258us : __engine_unpark: vcs0
<0> [629.616327] i915_sel-4301    0.... 618083271us : __gt_unpark: 
<0> [629.616530] i915_sel-4301    0.... 618084676us : __i915_request_commit: vcs0 fence e80d:2
<0> [629.616734] i915_sel-4301    0d..1 618084704us : __i915_request_submit: vcs0 fence e80d:2, current 0
<0> [629.616928] i915_sel-4301    0d..1 618084711us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes
<0> [629.617121] i915_sel-4301    0d..1 618084716us : trace_ports: vcs0: submit { e80d:2, 0:0 }
<0> [629.617325] i915_sel-4301    0.... 618084786us : __i915_request_commit: vcs0 fence e80e:2
<0> [629.617518]    dmesg-922     1..s. 618084787us : process_csb: vcs0 cs-irq head=3, tail=4
<0> [629.617711]    dmesg-922     1..s. 618084792us : process_csb: vcs0 csb[4]: status=0x00000001:0x00000000
<0> [629.617905]    dmesg-922     1..s. 618084798us : trace_ports: vcs0: promote { e80d:2*, 0:0 }
<0> [629.618109]    dmesg-922     1d.s1 618084814us : __i915_request_submit: vcs0 fence e80e:2, current 0
<0> [629.618302]    dmesg-922     1d.s1 618084823us : __execlists_submission_tasklet: vcs0: queue_priority_hint:-2147483648, submit:yes
<0> [629.618496]    dmesg-922     1d.s1 618084829us : trace_ports: vcs0: submit { e80d:2*, e80e:2 }
<0> [629.618700] i915_sel-4301    0.... 618084861us : __i915_request_commit: vcs0 fence e80d:4
<0> [629.618892]    dmesg-922     1..s. 618084907us : process_csb: vcs0 cs-irq head=4, tail=5
<0> [629.619085]    dmesg-922     1..s. 618084908us : process_csb: vcs0 csb[5]: status=0x00008002:0x00000001
<0> [629.619279]    dmesg-922     1..s. 618084912us : trace_ports: vcs0: preempted { e80d:2*, 0:0 }
<0> [629.619472]    dmesg-922     1..s. 618084916us : trace_ports: vcs0: promote { e80d:2*, e80e:2 }
<0> [629.619666]    dmesg-922     1d.s1 618084926us : __execlists_submission_tasklet: vcs0: expired last=e80d:2, prio=3, hint=3
<0> [629.619869]    dmesg-922     1d.s1 618084930us : __i915_request_unsubmit: vcs0 fence e80e:2, current 0
<0> [629.620073]    dmesg-922     1d.s1 618084936us : __i915_request_unsubmit: vcs0 fence e80d:2, current 1
<0> [629.620277]    dmesg-922     1d.s1 618084950us : __i915_request_submit: vcs0 fence e80d:2, current 1
<0> [629.620481]    dmesg-922     1d.s1 618084955us : __i915_request_submit: vcs0 fence e80e:2, current 0
<0> [629.620674]    dmesg-922     1d.s1 618084958us : __execlists_submission_tasklet: vcs0: queue_priority_

<7> [629.633154] __intel_gt_set_wedged vcs0
<7> [629.633159] __intel_gt_set_wedged 	Awake? 2
<7> [629.633165] __intel_gt_set_wedged 	Hangcheck: 0 ms ago
<7> [629.633170] __intel_gt_set_wedged 	Reset count: 0 (global 0)
<7> [629.633175] __intel_gt_set_wedged 	Requests:
<7> [629.633183] __intel_gt_set_wedged 	MMIO base:  0x00012000
<7> [629.633201] __intel_gt_set_wedged 	RING_START: 0x00000000
<7> [629.633211] __intel_gt_set_wedged 	RING_HEAD:  0x00000000
<7> [629.633220] __intel_gt_set_wedged 	RING_TAIL:  0x00000000
<7> [629.633232] __intel_gt_set_wedged 	RING_CTL:   0x00000000
<7> [629.633245] __intel_gt_set_wedged 	RING_MODE:  0x00000200 [idle]
<7> [629.633254] __intel_gt_set_wedged 	RING_IMR: fffffeff
<7> [629.633270] __intel_gt_set_wedged 	ACTHD:  0x00000000_00000000
<7> [629.633286] __intel_gt_set_wedged 	BBADDR: 0x00000000_00000000
<7> [629.633303] __intel_gt_set_wedged 	DMA_FADDR: 0x00000000_00000000
<7> [629.633312] __intel_gt_set_wedged 	IPEIR: 0x00000000
<7> [629.633321] __intel_gt_set_wedged 	IPEHR: 0x00000000
<7> [629.633334] __intel_gt_set_wedged 	Execlist status: 0x00000301 00000000, entries 6
<7> [629.633339] __intel_gt_set_wedged 	Execlist CSB read 4, write 4, tasklet queued? no (enabled)
<7> [629.633421] __intel_gt_set_wedged 		E  e80d:2!  prio=3 @ 11565ms: [i915]
<7> [629.633469] __intel_gt_set_wedged 		E  e80e:2!+  prio=3 @ 11565ms: signaled
<7> [629.633477] __intel_gt_set_wedged 		E  e80d:4!  prio=2 @ 11565ms: [i915]
<7> [629.633482] __intel_gt_set_wedged HWSP:
<7> [629.633490] __intel_gt_set_wedged [0000] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [629.633494] __intel_gt_set_wedged *
<7> [629.633502] __intel_gt_set_wedged [0040] 00008002 00000001 00000814 00000001 00000818 00000001 00000001 00000000
<7> [629.633508] __intel_gt_set_wedged [0060] 00000818 00000001 00008002 00000001 00000000 00000000 00000000 00000004
<7> [629.633515] __intel_gt_set_wedged [0080] 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
<7> [629.633520] __intel_gt_set_wedged *

where it is much clearer that unlite-restore hit it's mark and tricked the HW into doing something naughty.
Comment 6 Chris Wilson 2019-10-04 16:08:07 UTC
Hopefully it is the problem I was aiming for in the test...


commit 2935ed5339c495066d901b97de384d86c55a3f6d
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Oct 4 14:40:08 2019 +0100

    drm/i915: Remove logical HW ID
    
    With the introduction of ctx->engines[] we allow multiple logical
    contexts to be used on the same engine (e.g. with virtual engines).
    According to bspec, aach logical context requires a unique tag in order
    for context-switching to occur correctly between them. [Simple
    experiments show that it is not so easy to trick the HW into performing
    a lite-restore with matching logical IDs, though my memory from early
    Broadwell experiments do suggest that it should be generating
    lite-restores.]
    
    We only need to keep a unique tag for the active lifetime of the
    context, and for as long as we need to identify that context. The HW
    uses the tag to determine if it should use a lite-restore (why not the
    LRCA?) and passes the tag back for various status identifies. The only
    status we need to track is for OA, so when using perf, we assign the
    specific context a unique tag.
    
    v2: Calculate required number of tags to fill ELSP.
    
    Fixes: 976b55f0e1db ("drm/i915: Allow a context to define its set of engines
")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111895
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Acked-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20191004134015.13204-14-chris@chris-wilson.co.uk

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.