102332 – [BAT][CI][ALL] igt@potentially all tests - watchdog: watchdog0: watchdog did not stop

Bug 102332 - [BAT][CI][ALL] igt@potentially all tests - watchdog: watchdog0: watchdog did not stop

Summary: [BAT][CI][ALL] igt@potentially all tests - watchdog: watchdog0: watchdog did ...

Status:	CLOSED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	high critical
Assignee:	Petri Latvala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Duplicates (3):	102331 102656 102893 (view as bug list)
Depends on:
Blocks:

Reported:	2017-08-21 08:14 UTC by Martin Peres
Modified:	2017-12-08 07:56 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	ALL
i915 features:

Attachments
list of suspicious runs (1.63 KB, text/plain) 2017-09-21 10:53 UTC, Marta Löfstedt	no flags	Details
View All

Description Martin Peres 2017-08-21 08:14:59 UTC

On CI_DRM_2976 the sharded machine hsw3 hard hanged while executing igt@gem_wait@write-wait-bsd2.

This may be due to the following warning in the boot log:
[    3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6

Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2976/shard-
hsw3/igt@gem_wait@write-wait-bsd2.html

Comment 1 Chris Wilson 2017-08-21 09:36:52 UTC

Nah, that isn't as bad as:

[   25.103020] watchdog: watchdog0: watchdog did not stop!

Comment 2 Marta Löfstedt 2017-09-20 08:20:37 UTC

*** Bug 102893 has been marked as a duplicate of this bug. ***

Comment 3 Marta Löfstedt 2017-09-20 08:26:19 UTC

I believe this can happen to any test at any time. Very scary it could potentially kill all coverage on HSW.

Comment 4 Petri Latvala 2017-09-20 11:13:55 UTC

Something opened the watchdog device and didn't close it properly, leaving the timebomb ticking.

We wrap piglit in owatch, and theoretically this could be owatch dying, but that should also kill piglit, and there's IGT output in dmesg after the watchdog message.

Comment 5 Marta Löfstedt 2017-09-20 11:25:28 UTC

FYI: the "[    3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6" has nothing to do with this this log appear to be a feature on all boots of the HSW-shards.

Comment 6 Marta Löfstedt 2017-09-21 10:53:41 UTC

Created attachment 134405 [details]
list of suspicious runs

list of dmesgs from 21 runs where this issue has happen.

Comment 7 Marta Löfstedt 2017-09-22 07:59:40 UTC

*** Bug 102656 has been marked as a duplicate of this bug. ***

Comment 8 Marta Löfstedt 2017-09-22 08:12:27 UTC

*** Bug 102331 has been marked as a duplicate of this bug. ***

Comment 9 Marta Löfstedt 2017-09-22 08:24:54 UTC

I went through the list and found 2 more bugs that duplicates this issue.

Comment 10 Elizabeth 2017-09-28 15:48:40 UTC

Changing to medium/major since is sporadic BAT.

Comment 11 Jani Saarinen 2017-09-29 08:25:30 UTC

One more: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3154/fi-skl-6260u/igt@chamelium@dp-hpd-fast.html

Comment 12 Marta Löfstedt 2017-10-02 11:06:10 UTC

Here again:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3154/fi-kbl-7500u/igt@chamelium@dp-hpd-fast.html

Comment 13 Marta Löfstedt 2017-10-04 09:38:09 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3167/shard-hsw5/igt@kms_cursor_legacy@pipe-B-torture-bo.html

Comment 14 Marta Löfstedt 2017-10-04 11:33:24 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3171/shard-hsw5/igt@gem_mmap_gtt@basic.html

Comment 15 Marta Löfstedt 2017-10-04 11:33:56 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3171/shard-hsw4/igt@kms_cursor_legacy@pipe-F-forked-move.html

Comment 16 Marta Löfstedt 2017-10-12 06:01:40 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3210/shard-hsw5/igt@syncobj_wait@wait-all-for-submit-delayed-submit.html

Comment 17 Marta Löfstedt 2017-10-12 07:52:06 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3216/shard-kbl2/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-pri-indfb-draw-pwrite.html

Comment 18 Marta Löfstedt 2017-10-12 11:48:30 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3222/shard-kbl7/igt@kms_cursor_legacy@pipe-F-forked-bo.html

Comment 19 Marta Löfstedt 2017-10-12 11:49:37 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3222/shard-kbl1/igt@gem_persistent_relocs@forked-interruptible-thrashing.html

Comment 20 Marta Löfstedt 2017-10-16 07:06:12 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3235/shard-hsw6/igt@gem_exec_schedule@preempt-other-bsd1.html

Comment 21 Marta Löfstedt 2017-10-17 08:05:07 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3241/shard-kbl1/igt@kms_flip@blt-flip-vs-panning.html

Comment 22 Marta Löfstedt 2017-10-17 08:06:42 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3242/shard-kbl3/igt@kms_cursor_legacy@2x-long-cursor-vs-flip-legacy.html

Comment 23 Marta Löfstedt 2017-10-17 10:16:08 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3233/shard-kbl4/igt@kms_frontbuffer_tracking@psr-2p-primscrn-cur-indfb-draw-mmap-cpu.html

Comment 24 Marta Löfstedt 2017-10-18 06:16:12 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3255/fi-hsw-4770r/igt@chamelium@dp-edid-read.html

Comment 25 Marta Löfstedt 2017-10-18 07:08:36 UTC

Note:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-glkb5/dmesg-1508248150_Panic_1.log

<14>[  111.687451] [IGT] kms_cursor_legacy: exiting, ret=0
<2>[  111.764992] watchdog: watchdog0: watchdog did not stop!
<2>[  471.585302] softdog: Initiating panic

I have never seen this after 111s it is always otherwise at ~30s after boot. This is now even more scary since we potentially need to grep all dmesg to find all occurrences of this issue.

Comment 26 Marta Löfstedt 2017-10-18 14:02:29 UTC

Here is another occurrence of "late" watchdog: watchdog0: watchdog did not stop!

<7>[  471.959641] [drm:intel_dp_hpd_pulse [i915]] ignoring long hpd on eDP port A
<2>[  471.997559] watchdog: watchdog0: watchdog did not stop!


https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3259/fi-cnl-y/igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html

Comment 27 Marta Löfstedt 2017-10-20 07:15:44 UTC

Here is another late one on GLK-shards

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3267/shard-glkb3/igt@syncobj_wait@multi-wait-all-for-submit-unsubmitted.html

dmesg start at: <7>[  632.660857] [IGT] kms_addfb_basic: executing
i.e. this is not the first shard on this machine

<7>[  982.048270] [IGT] gem_sync: exiting, ret=0
<2>[  982.166183] watchdog: watchdog0: watchdog did not stop!
<7>[  982.258442] [IGT] syncobj_wait: executing
<7>[  982.288218] [IGT] syncobj_wait: starting subtest multi-wait-all-for-submit-unsubmitted
<7>[  982.389696] [IGT] syncobj_wait: exiting, ret=0

Comment 28 Marta Löfstedt 2017-10-23 10:52:21 UTC

Here is a classic BAT one:

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3271/fi-bdw-gvtdvm/igt@chamelium@dp-hpd-fast.html

Comment 29 Jani Saarinen 2017-10-25 09:09:16 UTC

Resolved as no i915 bug. Lets keep still in cibuglog. 
New owatch deployed, possibly not happen again.

Comment 30 Petri Latvala 2017-10-25 10:09:15 UTC

(In reply to Jani Saarinen from comment #29)
> Resolved as no i915 bug. Lets keep still in cibuglog. 
> New owatch deployed, possibly not happen again.

Owatch commit https://cgit.freedesktop.org/ezbench/commit/?h=owatch-ng&id=ca9a3fc38fdf8a00b75e89816dc9b37d3c4921d0 is what's deployed in CI now.

Comment 31 Marta Löfstedt 2017-10-26 10:28:48 UTC

NOTE: new owatch version from CI_DRM_3284

Comment 32 Marta Löfstedt 2017-10-30 07:30:07 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3291/shard-kbl5/igt@kms_plane_multiple@atomic-pipe-D-tiling-yf.html

<5>[   20.900148] owatch: Using watchdog device /dev/watchdog0
<5>[   20.900321] owatch: Watchdog /dev/watchdog0 is a software watchdog
<5>[   20.900891] owatch: timeout for /dev/watchdog0 set to 370 (requested 370)
<2>[   24.239245] watchdog: watchdog0: watchdog did not stop!
<6>[   24.261502] Console: switching to colour dummy device 80x25

Comment 33 Marta Löfstedt 2017-11-07 14:13:47 UTC

Late one on Patchwork:
https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6973/fi-cnl-y/igt@kms_force_connector_basic@force-connector-state.html

<2>[  423.183124] watchdog: watchdog0: watchdog did not stop!

Comment 34 Jani Saarinen 2017-11-28 08:20:57 UTC

Have this seen after 2017-11-18 ?

Comment 35 Marta Löfstedt 2017-11-28 09:08:45 UTC

(In reply to Jani Saarinen from comment #34)
> Have this seen after 2017-11-18 ?

No, but it has been seen with the exact same owatch as we are still running. So, this can't be closed until there is a clear analyze as to why that occurrence shouldn't count.

Comment 36 Marta Löfstedt 2017-12-08 07:56:13 UTC

This hasn't been seen for a month, since I close all other bugs on that criteria I see no reason to not archive this one.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.