Summary: | [BSW][BAT] igt@pm_rpm@basic-rte incomplete | ||
---|---|---|---|
Product: | DRI | Reporter: | Jani Saarinen <jani.saarinen> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | critical | ||
Priority: | highest | CC: | intel-gfx-bugs |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | |||
i915 platform: | BSW/CHT, BYT | i915 features: | power/runtime PM |
Description
Jani Saarinen
2017-04-19 06:23:07 UTC
We know it oops, but that was lost in the panic reboot. If it is our oops, we could leave the machine alive, log in after the 12/15 minute timeout, gather the logs, then reboot. That should have a better chance of getting the oops. Or netconsole... ;) There is a script console available for each machine in the farms: http://benchsrv.fi.intel.com:8080/jenkins/computer/fi-bsw-n3050/script I ran: println "panic_on_oops: " + "cat /proc/sys/kernel/panic_on_oops".execute().text with result: panic_on_oops: 0 So, oops should not cause panic. I was expecting pstore logs if a panic was generated. I suggest that we investigate why pstore isn't working properly. We should also deploy the owatch softdog solution we have discusssed in Jira. Close to 100% reproducing rate on BYT-n2820 and BSW-n3050 with DRM-Tip: igt@pm_rpm@gem-execbuf-stress igt@pm_rpm@gem-execbuf-stress-extra-wait Local visuals: - screen goes blank - host doesn't react to anything - if panic specified on cmdline, host reboots No log on console, no pstore saved. NUCs don't have serial headers to get any information out. Statistics: Failure rate 3/10 run(s) (30%) Also seen many time on bot / trybot I see [ 1354.589788] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 1354.589792] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers [ 1354.589807] CPU: 1 PID: 421 Comm: kworker/1:0 Tainted: G W 4.11.0-rc7+ #471 [ 1354.589812] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 1354.589814] Workqueue: pm pm_runtime_work [ 1354.589819] task: ffff880274d12d00 task.stack: ffffc900000c0000 [ 1354.589823] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915] [ 1354.589825] RSP: 0018:ffffc900000c3c48 EFLAGS: 00000082 [ 1354.589831] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001 [ 1354.589835] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000 [ 1354.589838] RBP: ffffc900000c3c88 R08: 0000000000000002 R09: 0000000000000000 [ 1354.589842] R10: ffffc900000c3c28 R11: ffff880276429dd0 R12: ffff880271808000 [ 1354.589846] R13: 00000000000000a0 R14: 0000000000000000 R15: 000000010003a4bc [ 1354.589849] FS: 0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [ 1354.589852] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1354.589856] CR2: 000055aef5babd60 CR3: 0000000002009000 CR4: 00000000001006e0 [ 1354.589858] Call Trace: [ 1354.589861] intel_wait_for_register+0x70/0x140 [i915] [ 1354.589864] vlv_suspend_complete+0x6f/0x610 [i915] [ 1354.589866] intel_runtime_suspend+0x16c/0x2a0 [i915] [ 1354.589869] pci_pm_runtime_suspend+0x50/0x180 [ 1354.589872] ? pci_pm_runtime_resume+0xa0/0xa0 [ 1354.589874] __rpm_callback+0xc5/0x210 [ 1354.589877] ? __queue_delayed_work+0x83/0x1a0 [ 1354.589879] rpm_callback+0x1f/0x80 [ 1354.589882] ? pci_pm_runtime_resume+0xa0/0xa0 [ 1354.589884] rpm_suspend+0x118/0x580 [ 1354.589887] ? __i915_gem_free_objects+0x2d1/0x3b0 [i915] [ 1354.589889] pm_runtime_work+0x64/0x90 [ 1354.589892] process_one_work+0x1bb/0x3e0 [ 1354.589894] worker_thread+0x46/0x4f0 [ 1354.589896] ? __schedule+0x18b/0x610 [ 1354.589899] kthread+0xff/0x140 [ 1354.589901] ? process_one_work+0x3e0/0x3e0 [ 1354.589904] ? kthread_create_on_node+0x40/0x40 [ 1354.589906] ret_from_fork+0x29/0x40 [ 1354.589908] Code: e8 ff 1d 02 e1 48 8b 15 18 25 96 e1 4c 8d 7c 02 01 48 8b 05 0c 25 96 e1 4c 89 f9 48 29 c1 48 89 d8 49 03 84 24 80 06 00 00 8b 10 <44> 89 e8 21 d0 41 39 c6 74 33 48 85 c9 0f 88 04 01 00 00 65 8b on both byt/bsw. That seems significant and probably the same. vlv_suspend_complete+0x6f is vlv_wait_for_gt_wells() I have the impression it doesn't approve of us reading VLV_GTLC_PW_STATUS frequently: diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c index 5f2aeb27aeb7..acb99603b955 100644 --- a/drivers/gpu/drm/i915/i915_drv.c +++ b/drivers/gpu/drm/i915/i915_drv.c @@ -2246,9 +2246,8 @@ static int vlv_wait_for_gt_wells(struct drm_i915_private *dev_priv, * RC6 transitioning can be delayed up to 2 msec (see * valleyview_enable_rps), use 3 msec for safety. */ - err = intel_wait_for_register(dev_priv, - VLV_GTLC_PW_STATUS, mask, val, - 3); + err = wait_for((I915_READ_NOTRACE(VLV_GTLC_PW_STATUS) & mask) == val, + 3); if (err) DRM_ERROR("timeout waiting for GT wells to go %s\n", onoff(wait_for_on)); Confirmed. With this patch BSW-n3050 can pass several rounds of # /opt/igt/scripts/run-tests.sh -vt igt@pm_rpm@gem-execbuf-stress Without the patch, fail rate is close to 100% for these 2/3 tests. Only tested on single machine. Please run through Trybot. Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> commit 3dd14c04d77d7d702de5aa7157df4cc9417329f3 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 21 14:58:15 2017 +0100 drm/i915: Avoid busy-spinning on VLV_GLTC_PW_STATUS mmio The busy-spin, as the first stage of intel_wait_for_register(), is currently under suspicion for causing: [ 62.034926] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1 [ 62.034928] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers [ 62.034932] CPU: 1 PID: 183 Comm: kworker/1:2 Not tainted 4.11.0-rc7+ #471 [ 62.034933] Hardware name: / , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015 [ 62.034934] Workqueue: pm pm_runtime_work [ 62.034936] task: ffff880275a04ec0 task.stack: ffffc900002d8000 [ 62.034936] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915] [ 62.034937] RSP: 0018:ffffc900002dbc38 EFLAGS: 00000082 [ 62.034939] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001 [ 62.034940] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000 [ 62.034941] RBP: ffffc900002dbc78 R08: 0000000000000002 R09: 0000000000000000 [ 62.034942] R10: ffffc900002dbc18 R11: ffff880276429dd0 R12: ffff8802707c0000 [ 62.034943] R13: 00000000000000a0 R14: 0000000000000000 R15: 00000000fffefc10 [ 62.034945] FS: 0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000 [ 62.034945] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 62.034947] CR2: 00007ffd3cd98ff8 CR3: 0000000274c19000 CR4: 00000000001006e0 [ 62.034947] Call Trace: [ 62.034948] intel_wait_for_register+0x77/0x140 [i915] [ 62.034949] vlv_suspend_complete+0x23/0x5b0 [i915] [ 62.034950] intel_runtime_suspend+0x16c/0x2a0 [i915] [ 62.034950] pci_pm_runtime_suspend+0x50/0x180 [ 62.034951] ? pci_pm_runtime_resume+0xa0/0xa0 [ 62.034952] __rpm_callback+0xc5/0x210 [ 62.034953] rpm_callback+0x1f/0x80 [ 62.034953] ? pci_pm_runtime_resume+0xa0/0xa0 [ 62.034954] rpm_suspend+0x118/0x580 [ 62.034955] pm_runtime_work+0x64/0x90 [ 62.034956] process_one_work+0x1bb/0x3e0 [ 62.034956] worker_thread+0x46/0x4f0 [ 62.034957] ? __schedule+0x18b/0x610 [ 62.034958] kthread+0xff/0x140 [ 62.034958] ? process_one_work+0x3e0/0x3e0 [ 62.034959] ? kthread_create_on_node+ and related hard lockups in CI for byt and bsw. Note this effectively reverts commits 41ce405e6894 and b27366958869 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") v2: Convert bool allow into a u32 mask for clarity and repeat the comment on vlv rc6 timing to justify the 3ms timeout used for the wait (Ville) Fixes: 41ce405e6894 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") Fixes: b27366958869 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()") Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100718 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Ville Syrjälä <ville.syrjala@linux.intel.com> Cc: Tomi Sarvela <tomi.p.sarvela@intel.com> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: http://patchwork.freedesktop.org/patch/msgid/20170421135815.11897-1-chris@chris-wilson.co.uk Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com> * fingers crossed *** Bug 99740 has been marked as a duplicate of this bug. *** |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.