On CI seen now twice: https://intel-gfx-ci.01.org/CI/CI_DRM_2529/fi-bsw-n3050/igt@pm_rpm@basic-rte.html https://intel-gfx-ci.01.org/CI/CI_DRM_2537/fi-bsw-n3050/igt@pm_rpm@basic-rte.html has now incompleted 2 times. Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/sda2 intel_iommu=igfx_off drm.debug=0xe nmi_watchdog=panic,auto panic=1 ro 3 No pstore: https://intel-gfx-ci.01.org/CI/Patchwork_4471/fi-bsw-n3050/ But has failed also on patchwork runs 23172, 16311, 22845 eg: https://patchwork.freedesktop.org/series/22845/ Also test igt@pm_rpm@basic-pci-d3-state has been failing on pw: https://patchwork.freedesktop.org/series/23020/ Documenting to get development attention and getting CI marked known issue.
We know it oops, but that was lost in the panic reboot. If it is our oops, we could leave the machine alive, log in after the 12/15 minute timeout, gather the logs, then reboot. That should have a better chance of getting the oops. Or netconsole... ;)
There is a script console available for each machine in the farms: http://benchsrv.fi.intel.com:8080/jenkins/computer/fi-bsw-n3050/script I ran: println "panic_on_oops: " + "cat /proc/sys/kernel/panic_on_oops".execute().text with result: panic_on_oops: 0 So, oops should not cause panic.
I was expecting pstore logs if a panic was generated. I suggest that we investigate why pstore isn't working properly. We should also deploy the owatch softdog solution we have discusssed in Jira.
Close to 100% reproducing rate on BYT-n2820 and BSW-n3050 with DRM-Tip: igt@pm_rpm@gem-execbuf-stress igt@pm_rpm@gem-execbuf-stress-extra-wait Local visuals: - screen goes blank - host doesn't react to anything - if panic specified on cmdline, host reboots No log on console, no pstore saved. NUCs don't have serial headers to get any information out.
Statistics: Failure rate 3/10 run(s) (30%) Also seen many time on bot / trybot
I see [ 1354.589788] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 1354.589792] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers [ 1354.589807] CPU: 1 PID: 421 Comm: kworker/1:0 Tainted: G W 4.11.0-rc7+ #471 [ 1354.589812] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 1354.589814] Workqueue: pm pm_runtime_work [ 1354.589819] task: ffff880274d12d00 task.stack: ffffc900000c0000 [ 1354.589823] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915] [ 1354.589825] RSP: 0018:ffffc900000c3c48 EFLAGS: 00000082 [ 1354.589831] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001 [ 1354.589835] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000 [ 1354.589838] RBP: ffffc900000c3c88 R08: 0000000000000002 R09: 0000000000000000 [ 1354.589842] R10: ffffc900000c3c28 R11: ffff880276429dd0 R12: ffff880271808000 [ 1354.589846] R13: 00000000000000a0 R14: 0000000000000000 R15: 000000010003a4bc [ 1354.589849] FS: 0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [ 1354.589852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1354.589856] CR2: 000055aef5babd60 CR3: 0000000002009000 CR4: 00000000001006e0 [ 1354.589858] Call Trace: [ 1354.589861] intel_wait_for_register+0x70/0x140 [i915] [ 1354.589864] vlv_suspend_complete+0x6f/0x610 [i915] [ 1354.589866] intel_runtime_suspend+0x16c/0x2a0 [i915] [ 1354.589869] pci_pm_runtime_suspend+0x50/0x180 [ 1354.589872] ? pci_pm_runtime_resume+0xa0/0xa0 [ 1354.589874] __rpm_callback+0xc5/0x210 [ 1354.589877] ? __queue_delayed_work+0x83/0x1a0 [ 1354.589879] rpm_callback+0x1f/0x80 [ 1354.589882] ? pci_pm_runtime_resume+0xa0/0xa0 [ 1354.589884] rpm_suspend+0x118/0x580 [ 1354.589887] ? __i915_gem_free_objects+0x2d1/0x3b0 [i915] [ 1354.589889] pm_runtime_work+0x64/0x90 [ 1354.589892] process_one_work+0x1bb/0x3e0 [ 1354.589894] worker_thread+0x46/0x4f0 [ 1354.589896] ? __schedule+0x18b/0x610 [ 1354.589899] kthread+0xff/0x140 [ 1354.589901] ? process_one_work+0x3e0/0x3e0 [ 1354.589904] ? kthread_create_on_node+0x40/0x40 [ 1354.589906] ret_from_fork+0x29/0x40 [ 1354.589908] Code: e8 ff 1d 02 e1 48 8b 15 18 25 96 e1 4c 8d 7c 02 01 48 8b 05 0c 25 96 e1 4c 89 f9 48 29 c1 48 89 d8 49 03 84 24 80 06 00 00 8b 10 <44> 89 e8 21 d0 41 39 c6 74 33 48 85 c9 0f 88 04 01 00 00 65 8b on both byt/bsw. That seems significant and probably the same.
vlv_suspend_complete+0x6f is vlv_wait_for_gt_wells()
I have the impression it doesn't approve of us reading VLV_GTLC_PW_STATUS frequently: diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 5f2aeb27aeb7..acb99603b955 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -2246,9 +2246,8 @@ static int vlv_wait_for_gt_wells(struct drm_i915_private *dev_priv, * RC6 transitioning can be delayed up to 2 msec (see * valleyview_enable_rps), use 3 msec for safety. */ - err = intel_wait_for_register(dev_priv, - VLV_GTLC_PW_STATUS, mask, val, - 3); + err = wait_for((I915_READ_NOTRACE(VLV_GTLC_PW_STATUS) & mask) == val, + 3); if (err) DRM_ERROR("timeout waiting for GT wells to go %s\n", onoff(wait_for_on));
Confirmed. With this patch BSW-n3050 can pass several rounds of # /opt/igt/scripts/run-tests.sh -vt igt@pm_rpm@gem-execbuf-stress Without the patch, fail rate is close to 100% for these 2/3 tests. Only tested on single machine. Please run through Trybot. Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com>
commit 3dd14c04d77d7d702de5aa7157df4cc9417329f3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 21 14:58:15 2017 +0100 drm/i915: Avoid busy-spinning on VLV_GLTC_PW_STATUS mmio The busy-spin, as the first stage of intel_wait_for_register(), is currently under suspicion for causing: [ 62.034926] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 62.034928] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers [ 62.034932] CPU: 1 PID: 183 Comm: kworker/1:2 Not tainted 4.11.0-rc7+ #471 [ 62.034933] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 62.034934] Workqueue: pm pm_runtime_work [ 62.034936] task: ffff880275a04ec0 task.stack: ffffc900002d8000 [ 62.034936] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915] [ 62.034937] RSP: 0018:ffffc900002dbc38 EFLAGS: 00000082 [ 62.034939] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001 [ 62.034940] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000 [ 62.034941] RBP: ffffc900002dbc78 R08: 0000000000000002 R09: 0000000000000000 [ 62.034942] R10: ffffc900002dbc18 R11: ffff880276429dd0 R12: ffff8802707c0000 [ 62.034943] R13: 00000000000000a0 R14: 0000000000000000 R15: 00000000fffefc10 [ 62.034945] FS: 0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [ 62.034945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.034947] CR2: 00007ffd3cd98ff8 CR3: 0000000274c19000 CR4: 00000000001006e0 [ 62.034947] Call Trace: [ 62.034948] intel_wait_for_register+0x77/0x140 [i915] [ 62.034949] vlv_suspend_complete+0x23/0x5b0 [i915] [ 62.034950] intel_runtime_suspend+0x16c/0x2a0 [i915] [ 62.034950] pci_pm_runtime_suspend+0x50/0x180 [ 62.034951] ? pci_pm_runtime_resume+0xa0/0xa0 [ 62.034952] __rpm_callback+0xc5/0x210 [ 62.034953] rpm_callback+0x1f/0x80 [ 62.034953] ? pci_pm_runtime_resume+0xa0/0xa0 [ 62.034954] rpm_suspend+0x118/0x580 [ 62.034955] pm_runtime_work+0x64/0x90 [ 62.034956] process_one_work+0x1bb/0x3e0 [ 62.034956] worker_thread+0x46/0x4f0 [ 62.034957] ? __schedule+0x18b/0x610 [ 62.034958] kthread+0xff/0x140 [ 62.034958] ? process_one_work+0x3e0/0x3e0 [ 62.034959] ? kthread_create_on_node+ and related hard lockups in CI for byt and bsw. Note this effectively reverts commits 41ce405e6894 and b27366958869 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") v2: Convert bool allow into a u32 mask for clarity and repeat the comment on vlv rc6 timing to justify the 3ms timeout used for the wait (Ville) Fixes: 41ce405e6894 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") Fixes: b27366958869 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100718 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170421135815.11897-1-chris@chris-wilson.co.uk Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> * fingers crossed
*** Bug 99740 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.