111397 – [CI][BAT]: igt@i915_selftest@live_hangcheck - timeout - list_add corruption. next->prev should be prev (ffff888405f2c430), but was ffff88848d9492c0. (next=ffff88848d94b480).

Bug 111397 - [CI][BAT]: igt@i915_selftest@live_hangcheck - timeout - list_add corruption. next->prev should be prev (ffff888405f2c430), but was ffff88848d9492c0. (next=ffff88848d94b480).

Summary: [CI][BAT]: igt@i915_selftest@live_hangcheck - timeout - list_add corruption....

Status:	RESOLVED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	highest normal
Assignee:	Intel GFX Bugs mailing list
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2019-08-14 06:03 UTC by Lakshmi
Modified:	2019-08-16 09:18 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:	ICL
i915 features:

Attachments

Description Lakshmi 2019-08-14 06:03:33 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5130/fi-icl-u2/igt@i915_selftest@live_hangcheck.html
<4> [837.968347] ------------[ cut here ]------------
<4> [837.968356] list_add corruption. next->prev should be prev (ffff888405f2c430), but was ffff88848d9492c0. (next=ffff88848d94b480).
<4> [837.968367] WARNING: CPU: 0 PID: 2448 at lib/list_debug.c:25 __list_add_valid+0x36/0x70
<4> [837.968369] Modules linked in: i915(+) amdgpu gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core ghash_clmulni_intel e1000e snd_pcm cdc_ether usbnet mii ptp pps_core mei_me mei prime_numbers btusb btrtl btbcm btintel bluetooth ecdh_generic ecc [last unloaded: i915]
<4> [837.968388] CPU: 0 PID: 2448 Comm: kworker/u16:44 Tainted: G     U            5.3.0-rc4-CI-CI_DRM_6691+ #1
<4> [837.968390] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP TLC, BIOS ICLSFWR1.R00.3183.A00.1905020411 05/02/2019
<4> [837.968434] Workqueue: i915 retire_work_handler [i915]
<4> [837.968437] RIP: 0010:__list_add_valid+0x36/0x70
<4> [837.968439] Code: 48 8b 10 4c 39 c2 75 27 48 39 f8 74 39 48 39 fa 74 34 b8 01 00 00 00 c3 48 89 d1 48 c7 c7 18 28 0e 82 48 89 c2 e8 aa 58 bc ff <0f> 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 68 28 0e 82 e8 93 58 bc
<4> [837.968441] RSP: 0018:ffffc90000b2fcd8 EFLAGS: 00010282
<4> [837.968443] RAX: 0000000000000000 RBX: ffff888405f20000 RCX: 0000000000000001
<4> [837.968444] RDX: 0000000080000001 RSI: ffff8883f0a3c990 RDI: 00000000ffffffff
<4> [837.968446] RBP: ffffc90000b2fd30 R08: ffff8883f0a3c990 R09: 0000000000000000
<4> [837.968447] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8883aac60008
<4> [837.968449] R13: ffff88833826dec0 R14: ffff88848d9481e0 R15: ffff88848d94b480
<4> [837.968450] FS:  0000000000000000(0000) GS:ffff88849fc00000(0000) knlGS:0000000000000000
<4> [837.968452] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [837.968453] CR2: 0000558311685f00 CR3: 0000000005210003 CR4: 0000000000760ef0
<4> [837.968455] PKRU: 55555554
<4> [837.968456] Call Trace:
<4> [837.968499]  __i915_request_commit+0x2dc/0x5d0 [i915]
<4> [837.968538]  ? __i915_request_create+0x228/0x4c0 [i915]
<4> [837.968574]  __engine_park+0x64/0x250 [i915]
<4> [837.968630]  ____intel_wakeref_put_last+0x1c/0x70 [i915]
<4> [837.968666]  i915_active_acquire_barrier+0x126/0x210 [i915]
<4> [837.968696]  intel_context_unpin+0xac/0xf0 [i915]
<4> [837.968732]  i915_request_retire+0x424/0x840 [i915]
<4> [837.968766]  ring_retire_requests+0x47/0x50 [i915]
<4> [837.968798]  i915_retire_requests+0x57/0xc0 [i915]
<4> [837.968831]  retire_work_handler+0x27/0x60 [i915]
<4> [837.968835]  process_one_work+0x245/0x610
<4> [837.968841]  worker_thread+0x37/0x380
<4> [837.968844]  ? process_one_work+0x610/0x610
<4> [837.968846]  kthread+0x119/0x130
<4> [837.968849]  ? kthread_park+0xa0/0xa0
<4> [837.968853]  ret_from_fork+0x24/0x50
<4> [837.968860] irq event stamp: 25769958
<4> [837.968863] hardirqs last  enabled at (25769957): [<ffffffff8112b537>] console_unlock+0x3f7/0x5a0
<4> [837.968865] hardirqs last disabled at (25769958): [<ffffffff810019ea>] trace_hardirqs_off_thunk+0x1a/0x20
<4> [837.968867] softirqs last  enabled at (25769938): [<ffffffff81c0033e>] __do_softirq+0x33e/0x4ae
<4> [837.968869] softirqs last disabled at (25769929): [<ffffffff810b8099>] irq_exit+0xa9/0xc0
<4> [837.968871] WARNING: CPU: 0 PID: 2448 at lib/list_debug.c:25 __list_add_valid+0x36/0x70
<4> [837.968873] ---[ end trace 472f2ecddc09acc7 ]---

Comment 1 CI Bug Log 2019-08-14 06:04:16 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@i915_selftest@live_hangcheck - timeout -  list_add corruption. next-&gt;prev should be prev (ffff888405f2c430), but was ffff88848d9492c0. (next=ffff88848d94b480).
  - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5130/fi-icl-u2/igt@i915_selftest@live_hangcheck.html

Comment 2 Chris Wilson 2019-08-14 08:27:44 UTC

The corruption started a little earlier. It should be fixed by https://patchwork.freedesktop.org/series/65152/

Comment 3 Chris Wilson 2019-08-14 09:19:13 UTC

commit f130b712ff259f03feaf7a245e3464e3faad89ae (HEAD -> drm-intel-next-queued, drm-intel/for-linux-next, drm-intel/drm-intel-next-queued)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Tue Aug 13 21:09:05 2019 +0100

    drm/i915: Serialise read/write of the barrier's engine
    
    We use the request pointer inside the i915_active_node as the indicator
    of the barrier's status; we mark it as used during
    i915_request_add_active_barriers(), and search for an available barrier
    in reuse_idle_barrier(). That check must be carefully serialised to
    ensure we do use an engine for the barrier and not just a random
    pointer. (Along the other reuse path, we are fully serialised by the
    timeline->mutex.) The acquisition of the barrier itself is ordered through
    the strong memory barrier in llist_del_all().
    
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111397
    Fixes: d8af05ff38ae ("drm/i915: Allow sharing the idle-barrier from other kernel requests")
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190813200905.11369-1-chris@chris-wilson.co.uk

Comment 4 Martin Peres 2019-08-16 08:37:36 UTC

Still visible: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6712/shard-apl1/igt@gem_exec_whisper@normal.html

<6> [686.556750] [IGT] gem_exec_whisper: starting subtest normal
<4> [816.468720] ------------[ cut here ]------------
<4> [816.468795] list_add corruption. next->prev should be prev (ffff88826140c430), but was ffff88825dec1110. (next=ffff888273e21980).
<4> [816.468824] WARNING: CPU: 2 PID: 1255 at lib/list_debug.c:25 __list_add_valid+0x36/0x70
<4> [816.468828] Modules linked in: vgem mei_hdcp x86_pkg_temp_thermal coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel i915 lpc_ich r8169 realtek snd_hda_intel snd_intel_nhlt snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me pinctrl_broxton pinctrl_intel mei prime_numbers
<4> [816.468858] CPU: 2 PID: 1255 Comm: kworker/u8:23 Tainted: G     U            5.3.0-rc4-CI-CI_DRM_6712+ #1
<4> [816.468862] Hardware name:  /NUC6CAYB, BIOS AYAPLCEL.86A.0049.2018.0508.1356 05/08/2018
<4> [816.468964] Workqueue: i915 retire_work_handler [i915]
<4> [816.468971] RIP: 0010:__list_add_valid+0x36/0x70
<4> [816.468976] Code: 48 8b 10 4c 39 c2 75 27 48 39 f8 74 39 48 39 fa 74 34 b8 01 00 00 00 c3 48 89 d1 48 c7 c7 b8 26 0e 82 48 89 c2 e8 aa 58 bc ff <0f> 0b 31 c0 c3 48 89 c1 4c 89 c6 48 c7 c7 08 27 0e 82 e8 93 58 bc
<4> [816.468980] RSP: 0018:ffffc9000089fd40 EFLAGS: 00010286
<4> [816.468986] RAX: 0000000000000000 RBX: ffff888262f59240 RCX: 0000000000000001
<4> [816.468990] RDX: 0000000080000001 RSI: ffff88826afd0940 RDI: 00000000ffffffff
<4> [816.468993] RBP: ffffc9000089fd80 R08: ffff88826afd0940 R09: 0000000000000000
<4> [816.468998] R10: 0000000000000000 R11: 0000000000000000 R12: ffff88825dec1cb8
<4> [816.469001] R13: 0000000000000000 R14: ffff88825dec1ce0 R15: ffff888273e21980
<4> [816.469006] FS:  0000000000000000(0000) GS:ffff888277b00000(0000) knlGS:0000000000000000
<4> [816.469010] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [816.469014] CR2: 00007f3b888c9c38 CR3: 000000026a452000 CR4: 00000000003406e0
<4> [816.469018] Call Trace:
<4> [816.469100]  __i915_request_commit+0x245/0x510 [i915]
<4> [816.469179]  __engine_park+0x70/0x270 [i915]
<4> [816.469251]  ____intel_wakeref_put_last+0x1c/0x70 [i915]
<4> [816.469335]  i915_request_retire+0x418/0x840 [i915]
<4> [816.469420]  ring_retire_requests+0x47/0x50 [i915]
<4> [816.469503]  i915_retire_requests+0x57/0xc0 [i915]
<4> [816.469583]  retire_work_handler+0x27/0x60 [i915]
<4> [816.469593]  process_one_work+0x245/0x610
<4> [816.469608]  worker_thread+0x37/0x380
<4> [816.469618]  ? process_one_work+0x610/0x610
<4> [816.469623]  kthread+0x119/0x130
<4> [816.469629]  ? kthread_park+0xa0/0xa0
<4> [816.469640]  ret_from_fork+0x3a/0x50
<4> [816.469661] irq event stamp: 852414
<4> [816.469666] hardirqs last  enabled at (852413): [<ffffffff8112b537>] console_unlock+0x3f7/0x5a0
<4> [816.469673] hardirqs last disabled at (852414): [<ffffffff810019ea>] trace_hardirqs_off_thunk+0x1a/0x20
<4> [816.469679] softirqs last  enabled at (852148): [<ffffffff8129a057>] wb_workfn+0x1a7/0x5f0
<4> [816.469683] softirqs last disabled at (852146): [<ffffffff81299f65>] wb_workfn+0xb5/0x5f0
<4> [816.469689] WARNING: CPU: 2 PID: 1255 at lib/list_debug.c:25 __list_add_valid+0x36/0x70
<4> [816.469693] ---[ end trace c7c4c36d1f93c572 ]---

Comment 5 CI Bug Log 2019-08-16 08:38:44 UTC

A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@i915_selftest@live_hangcheck - timeout -  list_add corruption. next-&gt;prev should be prev (ffff888405f2c430), but was ffff88848d9492c0. (next=ffff88848d94b480). -}
{+ all machines: igt@i915_selftest@live_hangcheck / igt@gem_exec_whisper@normal - timeout -  list_add corruption. next-&gt;prev should be prev +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6712/shard-apl1/igt@gem_exec_whisper@normal.html

Comment 6 CI Bug Log 2019-08-16 08:47:39 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* APL: igt@runner@aborted - fail - Previous test: gem_exec_whisper (normal)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6712/shard-apl1/igt@runner@aborted.html

Comment 7 Chris Wilson 2019-08-16 09:18:22 UTC

(In reply to Martin Peres from comment #4)
> Still visible:
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6712/shard-apl1/
> igt@gem_exec_whisper@normal.html
> 
> <6> [686.556750] [IGT] gem_exec_whisper: starting subtest normal
> <4> [816.468720] ------------[ cut here ]------------
> <4> [816.468795] list_add corruption. next->prev should be prev
> (ffff88826140c430), but was ffff88825dec1110. (next=ffff888273e21980)

That's not the same bug! That's the unserialised access to the active_rings which is elsewhere in bugzilla.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.