Bug 102332

Summary: [BAT][CI][ALL] igt@potentially all tests - watchdog: watchdog0: watchdog did not stop
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Petri Latvala <petri.latvala>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: high CC: intel-gfx-bugs, marta.lofstedt, ricardo.vega
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: ALL i915 features:
Attachments:
Description Flags
list of suspicious runs none

Description Martin Peres 2017-08-21 08:14:59 UTC
On CI_DRM_2976 the sharded machine hsw3 hard hanged while executing igt@gem_wait@write-wait-bsd2.

This may be due to the following warning in the boot log:
[    3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2976/shard-
hsw3/igt@gem_wait@write-wait-bsd2.html
Comment 1 Chris Wilson 2017-08-21 09:36:52 UTC
Nah, that isn't as bad as:

[   25.103020] watchdog: watchdog0: watchdog did not stop!
Comment 2 Marta Löfstedt 2017-09-20 08:20:37 UTC
*** Bug 102893 has been marked as a duplicate of this bug. ***
Comment 3 Marta Löfstedt 2017-09-20 08:26:19 UTC
I believe this can happen to any test at any time. Very scary it could potentially kill all coverage on HSW.
Comment 4 Petri Latvala 2017-09-20 11:13:55 UTC
Something opened the watchdog device and didn't close it properly, leaving the timebomb ticking.

We wrap piglit in owatch, and theoretically this could be owatch dying, but that should also kill piglit, and there's IGT output in dmesg after the watchdog message.
Comment 5 Marta Löfstedt 2017-09-20 11:25:28 UTC
FYI: the "[    3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6" has nothing to do with this this log appear to be a feature on all boots of the HSW-shards.
Comment 6 Marta Löfstedt 2017-09-21 10:53:41 UTC
Created attachment 134405 [details]
list of suspicious runs

list of dmesgs from 21 runs where this issue has happen.
Comment 7 Marta Löfstedt 2017-09-22 07:59:40 UTC
*** Bug 102656 has been marked as a duplicate of this bug. ***
Comment 8 Marta Löfstedt 2017-09-22 08:12:27 UTC
*** Bug 102331 has been marked as a duplicate of this bug. ***
Comment 9 Marta Löfstedt 2017-09-22 08:24:54 UTC
I went through the list and found 2 more bugs that duplicates this issue.
Comment 10 Elizabeth 2017-09-28 15:48:40 UTC
Changing to medium/major since is sporadic BAT.
Comment 25 Marta Löfstedt 2017-10-18 07:08:36 UTC
Note:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-glkb5/dmesg-1508248150_Panic_1.log

<14>[  111.687451] [IGT] kms_cursor_legacy: exiting, ret=0
<2>[  111.764992] watchdog: watchdog0: watchdog did not stop!
<2>[  471.585302] softdog: Initiating panic

I have never seen this after 111s it is always otherwise at ~30s after boot. This is now even more scary since we potentially need to grep all dmesg to find all occurrences of this issue.
Comment 26 Marta Löfstedt 2017-10-18 14:02:29 UTC
Here is another occurrence of "late" watchdog: watchdog0: watchdog did not stop!

<7>[  471.959641] [drm:intel_dp_hpd_pulse [i915]] ignoring long hpd on eDP port A
<2>[  471.997559] watchdog: watchdog0: watchdog did not stop!


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3259/fi-cnl-y/igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html
Comment 27 Marta Löfstedt 2017-10-20 07:15:44 UTC
Here is another late one on GLK-shards

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3267/shard-glkb3/igt@syncobj_wait@multi-wait-all-for-submit-unsubmitted.html

dmesg start at: <7>[  632.660857] [IGT] kms_addfb_basic: executing
i.e. this is not the first shard on this machine

<7>[  982.048270] [IGT] gem_sync: exiting, ret=0
<2>[  982.166183] watchdog: watchdog0: watchdog did not stop!
<7>[  982.258442] [IGT] syncobj_wait: executing
<7>[  982.288218] [IGT] syncobj_wait: starting subtest multi-wait-all-for-submit-unsubmitted
<7>[  982.389696] [IGT] syncobj_wait: exiting, ret=0
Comment 29 Jani Saarinen 2017-10-25 09:09:16 UTC
Resolved as no i915 bug. Lets keep still in cibuglog. 
New owatch deployed, possibly not happen again.
Comment 30 Petri Latvala 2017-10-25 10:09:15 UTC
(In reply to Jani Saarinen from comment #29)
> Resolved as no i915 bug. Lets keep still in cibuglog. 
> New owatch deployed, possibly not happen again.

Owatch commit https://cgit.freedesktop.org/ezbench/commit/?h=owatch-ng&id=ca9a3fc38fdf8a00b75e89816dc9b37d3c4921d0 is what's deployed in CI now.
Comment 31 Marta Löfstedt 2017-10-26 10:28:48 UTC
NOTE: new owatch version from CI_DRM_3284
Comment 32 Marta Löfstedt 2017-10-30 07:30:07 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3291/shard-kbl5/igt@kms_plane_multiple@atomic-pipe-D-tiling-yf.html

<5>[   20.900148] owatch: Using watchdog device /dev/watchdog0
<5>[   20.900321] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   20.900891] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
<2>[   24.239245] watchdog: watchdog0: watchdog did not stop!
<6>[   24.261502] Console: switching to colour dummy device 80x25
Comment 33 Marta Löfstedt 2017-11-07 14:13:47 UTC
Late one on Patchwork:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6973/fi-cnl-y/igt@kms_force_connector_basic@force-connector-state.html

<2>[  423.183124] watchdog: watchdog0: watchdog did not stop!
Comment 34 Jani Saarinen 2017-11-28 08:20:57 UTC
Have this seen after 2017-11-18 ?
Comment 35 Marta Löfstedt 2017-11-28 09:08:45 UTC
(In reply to Jani Saarinen from comment #34)
> Have this seen after 2017-11-18 ?

No, but it has been seen with the exact same owatch as we are still running. So, this can't be closed until there is a clear analyze as to why that occurrence shouldn't count.
Comment 36 Marta Löfstedt 2017-12-08 07:56:13 UTC
This hasn't been seen for a month, since I close all other bugs on that criteria I see no reason to not archive this one.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.