On CI_DRM_2976 the sharded machine hsw3 hard hanged while executing igt@gem_wait@write-wait-bsd2. This may be due to the following warning in the boot log: [ 3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6 Full logs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_2976/shard- hsw3/igt@gem_wait@write-wait-bsd2.html
Nah, that isn't as bad as: [ 25.103020] watchdog: watchdog0: watchdog did not stop!
*** Bug 102893 has been marked as a duplicate of this bug. ***
I believe this can happen to any test at any time. Very scary it could potentially kill all coverage on HSW.
Something opened the watchdog device and didn't close it properly, leaving the timebomb ticking. We wrap piglit in owatch, and theoretically this could be owatch dying, but that should also kill piglit, and there's IGT output in dmesg after the watchdog message.
FYI: the "[ 3.744474] [drm:sandybridge_pcode_read [i915]] warning: pcode (read from mbox 5) mailbox access failed for intel_enable_gt_powersave [i915]: -6" has nothing to do with this this log appear to be a feature on all boots of the HSW-shards.
Created attachment 134405 [details] list of suspicious runs list of dmesgs from 21 runs where this issue has happen.
*** Bug 102656 has been marked as a duplicate of this bug. ***
*** Bug 102331 has been marked as a duplicate of this bug. ***
I went through the list and found 2 more bugs that duplicates this issue.
Changing to medium/major since is sporadic BAT.
One more: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3154/fi-skl-6260u/igt@chamelium@dp-hpd-fast.html
Here again: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3154/fi-kbl-7500u/igt@chamelium@dp-hpd-fast.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3167/shard-hsw5/igt@kms_cursor_legacy@pipe-B-torture-bo.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3171/shard-hsw5/igt@gem_mmap_gtt@basic.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3171/shard-hsw4/igt@kms_cursor_legacy@pipe-F-forked-move.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3210/shard-hsw5/igt@syncobj_wait@wait-all-for-submit-delayed-submit.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3216/shard-kbl2/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-pri-indfb-draw-pwrite.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3222/shard-kbl7/igt@kms_cursor_legacy@pipe-F-forked-bo.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3222/shard-kbl1/igt@gem_persistent_relocs@forked-interruptible-thrashing.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3235/shard-hsw6/igt@gem_exec_schedule@preempt-other-bsd1.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3241/shard-kbl1/igt@kms_flip@blt-flip-vs-panning.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3242/shard-kbl3/igt@kms_cursor_legacy@2x-long-cursor-vs-flip-legacy.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3233/shard-kbl4/igt@kms_frontbuffer_tracking@psr-2p-primscrn-cur-indfb-draw-mmap-cpu.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3255/fi-hsw-4770r/igt@chamelium@dp-edid-read.html
Note: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3252/shard-glkb5/dmesg-1508248150_Panic_1.log <14>[ 111.687451] [IGT] kms_cursor_legacy: exiting, ret=0 <2>[ 111.764992] watchdog: watchdog0: watchdog did not stop! <2>[ 471.585302] softdog: Initiating panic I have never seen this after 111s it is always otherwise at ~30s after boot. This is now even more scary since we potentially need to grep all dmesg to find all occurrences of this issue.
Here is another occurrence of "late" watchdog: watchdog0: watchdog did not stop! <7>[ 471.959641] [drm:intel_dp_hpd_pulse [i915]] ignoring long hpd on eDP port A <2>[ 471.997559] watchdog: watchdog0: watchdog did not stop! https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3259/fi-cnl-y/igt@kms_pipe_crc_basic@read-crc-pipe-b-frame-sequence.html
Here is another late one on GLK-shards https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3267/shard-glkb3/igt@syncobj_wait@multi-wait-all-for-submit-unsubmitted.html dmesg start at: <7>[ 632.660857] [IGT] kms_addfb_basic: executing i.e. this is not the first shard on this machine <7>[ 982.048270] [IGT] gem_sync: exiting, ret=0 <2>[ 982.166183] watchdog: watchdog0: watchdog did not stop! <7>[ 982.258442] [IGT] syncobj_wait: executing <7>[ 982.288218] [IGT] syncobj_wait: starting subtest multi-wait-all-for-submit-unsubmitted <7>[ 982.389696] [IGT] syncobj_wait: exiting, ret=0
Here is a classic BAT one: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3271/fi-bdw-gvtdvm/igt@chamelium@dp-hpd-fast.html
Resolved as no i915 bug. Lets keep still in cibuglog. New owatch deployed, possibly not happen again.
(In reply to Jani Saarinen from comment #29) > Resolved as no i915 bug. Lets keep still in cibuglog. > New owatch deployed, possibly not happen again. Owatch commit https://cgit.freedesktop.org/ezbench/commit/?h=owatch-ng&id=ca9a3fc38fdf8a00b75e89816dc9b37d3c4921d0 is what's deployed in CI now.
NOTE: new owatch version from CI_DRM_3284
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3291/shard-kbl5/igt@kms_plane_multiple@atomic-pipe-D-tiling-yf.html <5>[ 20.900148] owatch: Using watchdog device /dev/watchdog0 <5>[ 20.900321] owatch: Watchdog /dev/watchdog0 is a software watchdog <5>[ 20.900891] owatch: timeout for /dev/watchdog0 set to 370 (requested 370) <2>[ 24.239245] watchdog: watchdog0: watchdog did not stop! <6>[ 24.261502] Console: switching to colour dummy device 80x25
Late one on Patchwork: https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_6973/fi-cnl-y/igt@kms_force_connector_basic@force-connector-state.html <2>[ 423.183124] watchdog: watchdog0: watchdog did not stop!
Have this seen after 2017-11-18 ?
(In reply to Jani Saarinen from comment #34) > Have this seen after 2017-11-18 ? No, but it has been seen with the exact same owatch as we are still running. So, this can't be closed until there is a clear analyze as to why that occurrence shouldn't count.
This hasn't been seen for a month, since I close all other bugs on that criteria I see no reason to not archive this one.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.