Bug 100718

Summary:	[BSW][BAT] igt@pm_rpm@basic-rte incomplete
Product:	DRI	Reporter:	Jani Saarinen <jani.saarinen>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	CLOSED FIXED	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	critical
Priority:	highest	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:
i915 platform:	BSW/CHT, BYT	i915 features:	power/runtime PM

Description Jani Saarinen 2017-04-19 06:23:07 UTC

On CI seen now twice: 
https://intel-gfx-ci.01.org/CI/CI_DRM_2529/fi-bsw-n3050/igt@pm_rpm@basic-rte.html
https://intel-gfx-ci.01.org/CI/CI_DRM_2537/fi-bsw-n3050/igt@pm_rpm@basic-rte.html

has now incompleted 2 times.

Command line: BOOT_IMAGE=/boot/drm_intel root=/dev/sda2 intel_iommu=igfx_off drm.debug=0xe nmi_watchdog=panic,auto panic=1 ro 3

No pstore: https://intel-gfx-ci.01.org/CI/Patchwork_4471/fi-bsw-n3050/

But has failed also on patchwork runs 23172, 16311, 22845
eg: https://patchwork.freedesktop.org/series/22845/

Also test igt@pm_rpm@basic-pci-d3-state has been failing on pw:
https://patchwork.freedesktop.org/series/23020/

Documenting to get development attention and getting CI marked known issue.

Comment 1 Chris Wilson 2017-04-19 07:17:04 UTC

We know it oops, but that was lost in the panic reboot. If it is our oops, we could leave the machine alive, log in after the 12/15 minute timeout, gather the logs, then reboot. That should have a better chance of getting the oops. Or netconsole... ;)

Comment 2 Marta Löfstedt 2017-04-19 07:53:31 UTC

There is a script console available for each machine in the farms:

http://benchsrv.fi.intel.com:8080/jenkins/computer/fi-bsw-n3050/script

I ran:
println "panic_on_oops: " + "cat /proc/sys/kernel/panic_on_oops".execute().text

with result:
panic_on_oops: 0

So, oops should not cause panic.

Comment 3 Marta Löfstedt 2017-04-19 08:06:44 UTC

I was expecting pstore logs if a panic was generated. 
I suggest that we investigate why pstore isn't working properly. 

We should also deploy the owatch softdog solution we have discusssed in Jira.

Comment 4 Tomi Sarvela 2017-04-19 10:16:54 UTC

Close to 100% reproducing rate on BYT-n2820 and BSW-n3050 with DRM-Tip:
igt@pm_rpm@gem-execbuf-stress
igt@pm_rpm@gem-execbuf-stress-extra-wait

Local visuals:
- screen goes blank
- host doesn't react to anything
- if panic specified on cmdline, host reboots

No log on console, no pstore saved. NUCs don't have serial headers to get any information out.

Comment 5 Jani Saarinen 2017-04-21 07:15:49 UTC

Statistics: Failure rate 3/10 run(s) (30%)
Also seen many time on bot / trybot

Comment 6 Chris Wilson 2017-04-21 10:11:32 UTC

I see

[ 1354.589788] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
[ 1354.589792] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers
[ 1354.589807] CPU: 1 PID: 421 Comm: kworker/1:0 Tainted: G        W       4.11.0-rc7+ #471
[ 1354.589812] Hardware name:                  /        , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
[ 1354.589814] Workqueue: pm pm_runtime_work
[ 1354.589819] task: ffff880274d12d00 task.stack: ffffc900000c0000
[ 1354.589823] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915]
[ 1354.589825] RSP: 0018:ffffc900000c3c48 EFLAGS: 00000082
[ 1354.589831] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001
[ 1354.589835] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000
[ 1354.589838] RBP: ffffc900000c3c88 R08: 0000000000000002 R09: 0000000000000000
[ 1354.589842] R10: ffffc900000c3c28 R11: ffff880276429dd0 R12: ffff880271808000
[ 1354.589846] R13: 00000000000000a0 R14: 0000000000000000 R15: 000000010003a4bc
[ 1354.589849] FS:  0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
[ 1354.589852] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1354.589856] CR2: 000055aef5babd60 CR3: 0000000002009000 CR4: 00000000001006e0
[ 1354.589858] Call Trace:
[ 1354.589861]  intel_wait_for_register+0x70/0x140 [i915]
[ 1354.589864]  vlv_suspend_complete+0x6f/0x610 [i915]
[ 1354.589866]  intel_runtime_suspend+0x16c/0x2a0 [i915]
[ 1354.589869]  pci_pm_runtime_suspend+0x50/0x180
[ 1354.589872]  ? pci_pm_runtime_resume+0xa0/0xa0
[ 1354.589874]  __rpm_callback+0xc5/0x210
[ 1354.589877]  ? __queue_delayed_work+0x83/0x1a0
[ 1354.589879]  rpm_callback+0x1f/0x80
[ 1354.589882]  ? pci_pm_runtime_resume+0xa0/0xa0
[ 1354.589884]  rpm_suspend+0x118/0x580
[ 1354.589887]  ? __i915_gem_free_objects+0x2d1/0x3b0 [i915]
[ 1354.589889]  pm_runtime_work+0x64/0x90
[ 1354.589892]  process_one_work+0x1bb/0x3e0
[ 1354.589894]  worker_thread+0x46/0x4f0
[ 1354.589896]  ? __schedule+0x18b/0x610
[ 1354.589899]  kthread+0xff/0x140
[ 1354.589901]  ? process_one_work+0x3e0/0x3e0
[ 1354.589904]  ? kthread_create_on_node+0x40/0x40
[ 1354.589906]  ret_from_fork+0x29/0x40
[ 1354.589908] Code: e8 ff 1d 02 e1 48 8b 15 18 25 96 e1 4c 8d 7c 02 01 48 8b 05 0c 25 96 e1 4c 89 f9 48 29 c1 48 89 d8 49 03 84 24 80 06 00 00 8b 10 <44> 89 e8 21 d0 41 39 c6 74 33 48 85 c9 0f 88 04 01 00 00 65 8b 

on both byt/bsw. That seems significant and probably the same.

Comment 7 Chris Wilson 2017-04-21 10:15:04 UTC

vlv_suspend_complete+0x6f is vlv_wait_for_gt_wells()

Comment 8 Chris Wilson 2017-04-21 11:49:36 UTC

I have the impression it doesn't approve of us reading VLV_GTLC_PW_STATUS frequently:

diff --git a/drivers/gpu/drm/i915/i915_drv.c b/drivers/gpu/drm/i915/i915_drv.c
index 5f2aeb27aeb7..acb99603b955 100644
--- a/drivers/gpu/drm/i915/i915_drv.c
+++ b/drivers/gpu/drm/i915/i915_drv.c
@@ -2246,9 +2246,8 @@ static int vlv_wait_for_gt_wells(struct drm_i915_private *dev_priv,
         * RC6 transitioning can be delayed up to 2 msec (see
         * valleyview_enable_rps), use 3 msec for safety.
         */
-       err = intel_wait_for_register(dev_priv,
-                                     VLV_GTLC_PW_STATUS, mask, val,
-                                     3);
+       err = wait_for((I915_READ_NOTRACE(VLV_GTLC_PW_STATUS) & mask) == val,
+                      3);
        if (err)
                DRM_ERROR("timeout waiting for GT wells to go %s\n",
                          onoff(wait_for_on));

Comment 9 Tomi Sarvela 2017-04-21 12:59:34 UTC

Confirmed. With this patch BSW-n3050 can pass several rounds of
# /opt/igt/scripts/run-tests.sh -vt igt@pm_rpm@gem-execbuf-stress

Without the patch, fail rate is close to 100% for these 2/3 tests.

Only tested on single machine. Please run through Trybot.

Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com>

Comment 10 Chris Wilson 2017-04-21 15:44:27 UTC

commit 3dd14c04d77d7d702de5aa7157df4cc9417329f3
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 21 14:58:15 2017 +0100

    drm/i915: Avoid busy-spinning on VLV_GLTC_PW_STATUS mmio
    
    The busy-spin, as the first stage of intel_wait_for_register(), is
    currently under suspicion for causing:
    
    [   62.034926] NMI watchdog: Watchdog detected hard LOCKUP on cpu 1
    [   62.034928] Modules linked in: i2c_dev i915 intel_gtt drm_kms_helper prime_numbers
    [   62.034932] CPU: 1 PID: 183 Comm: kworker/1:2 Not tainted 4.11.0-rc7+ #471
    [   62.034933] Hardware name:                  /        , BIOS PYBSWCEL.86A.0027.2015.0507.1758 05/07/2015
    [   62.034934] Workqueue: pm pm_runtime_work
    [   62.034936] task: ffff880275a04ec0 task.stack: ffffc900002d8000
    [   62.034936] RIP: 0010:__intel_wait_for_register_fw+0x77/0x1a0 [i915]
    [   62.034937] RSP: 0018:ffffc900002dbc38 EFLAGS: 00000082
    [   62.034939] RAX: ffffc90003530094 RBX: 0000000000130094 RCX: 0000000000000001
    [   62.034940] RDX: 00000000000000a1 RSI: ffff88027fd15e58 RDI: 0000000000000000
    [   62.034941] RBP: ffffc900002dbc78 R08: 0000000000000002 R09: 0000000000000000
    [   62.034942] R10: ffffc900002dbc18 R11: ffff880276429dd0 R12: ffff8802707c0000
    [   62.034943] R13: 00000000000000a0 R14: 0000000000000000 R15: 00000000fffefc10
    [   62.034945] FS:  0000000000000000(0000) GS:ffff88027fd00000(0000) knlGS:0000000000000000
    [   62.034945] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [   62.034947] CR2: 00007ffd3cd98ff8 CR3: 0000000274c19000 CR4: 00000000001006e0
    [   62.034947] Call Trace:
    [   62.034948]  intel_wait_for_register+0x77/0x140 [i915]
    [   62.034949]  vlv_suspend_complete+0x23/0x5b0 [i915]
    [   62.034950]  intel_runtime_suspend+0x16c/0x2a0 [i915]
    [   62.034950]  pci_pm_runtime_suspend+0x50/0x180
    [   62.034951]  ? pci_pm_runtime_resume+0xa0/0xa0
    [   62.034952]  __rpm_callback+0xc5/0x210
    [   62.034953]  rpm_callback+0x1f/0x80
    [   62.034953]  ? pci_pm_runtime_resume+0xa0/0xa0
    [   62.034954]  rpm_suspend+0x118/0x580
    [   62.034955]  pm_runtime_work+0x64/0x90
    [   62.034956]  process_one_work+0x1bb/0x3e0
    [   62.034956]  worker_thread+0x46/0x4f0
    [   62.034957]  ? __schedule+0x18b/0x610
    [   62.034958]  kthread+0xff/0x140
    [   62.034958]  ? process_one_work+0x3e0/0x3e0
    [   62.034959]  ? kthread_create_on_node+
    
    and related hard lockups in CI for byt and bsw.
    
    Note this effectively reverts commits 41ce405e6894 and b27366958869
    ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()")
    
    v2: Convert bool allow into a u32 mask for clarity and repeat the
    comment on vlv rc6 timing to justify the 3ms timeout used for the wait (Ville)
    
    Fixes: 41ce405e6894 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()")
    Fixes: b27366958869 ("drm/i915: Convert wait_for(I915_READ(reg)) to intel_wait_for_register()")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100718
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Cc: Tomi Sarvela <tomi.p.sarvela@intel.com>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170421135815.11897-1-chris@chris-wilson.co.uk
    Tested-by: Tomi Sarvela <tomi.p.sarvela@intel.com>

* fingers crossed

Comment 11 Jani Saarinen 2017-04-21 15:52:52 UTC

*** Bug 99740 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.