Summary: | [CI][SHARDS] Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop! | ||
---|---|---|---|
Product: | DRI | Reporter: | Lakshmi <lakshminarayana.vudum> |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | high | CC: | arkadiusz.hiler, intel-gfx-bugs, petri.latvala |
Version: | DRI git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev, Triaged | ||
i915 platform: | CFL, ICL | i915 features: | CI Infra |
Description
Lakshmi
2019-03-26 10:20:21 UTC
306.475594] watchdog: watchdog0: watchdog did not stop! The CI Bug Log issue associated to this bug has been updated. ### New filters associated * shard-iclb6: Random tests - incomplete - watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12512/shard-iclb6/igt@gem_mmap_gtt@basic-write-gtt.html - https://intel-gfx-ci.01.org/tree/drm-tip/Patchwork_12514/shard-iclb6/igt@gem_mmap_gtt@basic-write-gtt.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4903/shard-iclb6/igt@gem_mocs_settings@mocs-rc6-blt.html A CI Bug Log filter associated to this bug has been updated: {- shard-iclb6: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5829/shard-iclb8/igt@kms_psr@psr2_cursor_mmap_cpu.html A CI Bug Log filter associated to this bug has been updated: {- shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5830/shard-iclb5/igt@i915_hangman@error-state-capture-bsd1.html A CI Bug Log filter associated to this bug has been updated: {- shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5833/shard-iclb1/igt@i915_hangman@error-state-capture-bsd1.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5836/fi-icl-u3/igt@gem_cpu_reloc@basic.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} No new failures caught with the new filter A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5838/shard-iclb6/igt@gem_exec_schedule@preempt-queue-bsd2.html * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4913/shard-iclb1/igt@kms_frontbuffer_tracking@fbc-2p-primscrn-spr-indfb-draw-pwrite.html * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4914/shard-iclb1/igt@gem_wait@write-busy-blt.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5845/fi-icl-guc/igt@gem_basic@create-fd-close.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} No new failures caught with the new filter A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4928/shard-iclb3/igt@gem_exec_store@cachelines-blt.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5879/fi-icl-u2/igt@gem_exec_basic@basic-blt.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_249/fi-icl-u3/igt@kms_flip@2x-single-buffer-flip-vs-dpms-off-vs-modeset.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_248/fi-icl-u3/igt@kms_cursor_legacy@flip-vs-cursor-crc-legacy.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_250/fi-icl-u3/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions-varying-size.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_250/fi-icl-y/igt@kms_cursor_legacy@cursor-vs-flip-atomic-transitions-varying-size.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_250/fi-icl-y/igt@drm_read@short-buffer-block.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_250/fi-icl-y/igt@prime_nv_pcopy@test_semaphore.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +} No new failures caught with the new filter A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5906/shard-iclb7/igt@kms_cursor_legacy@pipe-b-single-move.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5913/shard-iclb7/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-pri-indfb-draw-mmap-wc.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5913/shard-iclb7/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-render.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_252/fi-icl-u3/igt@prime_vgem@wait-vebox.html A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_252/fi-icl-y/igt@kms_flip@plain-flip-ts-check.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_252/fi-icl-y/igt@gem_create@stolen-invalid-flag.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_253/fi-icl-u3/igt@syncobj_basic@create-signaled.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_254/fi-icl-u2/igt@gem_mmap@basic-small-bo.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_254/fi-icl-u2/igt@gem_userptr_blits@stress-mm.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_254/fi-icl-y/igt@prime_busy@after-bsd2.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_254/fi-icl-y/igt@perf@invalid-oa-metric-set-id.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_255/fi-icl-u2/igt@gem_exec_fence@basic-await-default.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_255/fi-icl-y/igt@gem_pread@self.html This is a CI issue like how we run the tests. Random tests are failed as incompletes. The impact of this bug is that we don't test results from the incompletes. We can not know the user impact at the moment unless the bug is fixed, e.g we can't catch regressions. All the failures in this bugs are seen only on ICL. (In reply to Lakshmi from comment #19) > This is a CI issue like how we run the tests. Random tests are failed as > incompletes. > The impact of this bug is that we don't test results from the incompletes. > We can not know the user impact at the moment unless the bug is fixed, e.g > we can't catch regressions. > > All the failures in this bugs are seen only on ICL. What Lakshmi was trying to say is that we cannot know the health of ICL until this bug is fixed, because we cannot distinguish if an incomplete comes because of a platform issue or a CI system issue. We need to investigate why ICL is behaving so badly compared to the other shards or RVPs (or maybe we just try too hard compared to other platforms where we just ignore all incompletes), and experiment with potential fixes. Arek/Petri any comments on this bug, how do we proceed with this bug? Petri mentioned there has been some changes made on CI side so that omm killer do not kill java...So has this been seen after this change? A CI Bug Log filter associated to this bug has been updated: {- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -} {+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop! +} No new failures caught with the new filter The CI Bug Log issue associated to this bug has been updated. ### New filters associated * CFL: Random tests - incomplete - Empty logs - watchdog: watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_259/fi-cfl-8109u/igt@kms_cursor_edge_walk@pipe-b-64x64-right-edge.html (In reply to CI Bug Log from comment #24) > The CI Bug Log issue associated to this bug has been updated. > > ### New filters associated > > * CFL: Random tests - incomplete - Empty logs - watchdog: watchdog0: > watchdog did not stop! > - > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_259/fi-cfl-8109u/ > igt@kms_cursor_edge_walk@pipe-b-64x64-right-edge.html Also seen on CFL. The CI Bug Log issue associated to this bug has been updated. ### New filters associated * SKL: Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_262/fi-skl-6600u/igt@kms_addfb_basic@framebuffer-vs-set-tiling.html (In reply to Jani Saarinen from comment #22) > Petri mentioned there has been some changes made on CI side so that omm > killer do not kill java...So has this been seen after this change? I believe this is the patch: commit a52cc643cfe6733465cfc9ccb3d21cbdc4fd7506 Author: Petri Latvala <petri.latvala@intel.com> Date: Thu Apr 11 12:40:12 2019 +0300 runner: Make sure oom-killer doesn't kill the runner Tests that eat all of the RAM and then some to invoke the oom-killer deliberately sometimes cause extra casualties. Make sure the runner stays alive. Signed-off-by: Petri Latvala <petri.latvala@intel.com> Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk> The issue still occurs: https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6005/fi-icl-u2/igt@gem_exec_basic@basic-blt.html https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6012/fi-icl-u2/igt@gem_exec_basic@basic-blt.html Reproduction rate seems to be still in the same ballpark. The CI Bug Log issue associated to this bug has been updated. ### New filters associated * fi-cml-u2: Random tests - incomplete - watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6070/fi-cml-u2/igt@gem_cpu_reloc@basic.html A CI Bug Log filter associated to this bug has been updated: {- fi-cml-u2: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ CML: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_284/fi-cml-u/igt@gem_mmap_gtt@basic-read-no-prefault.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6081/fi-cml-u/igt@gem_cpu_reloc@basic.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_285/fi-cml-u/igt@kms_universal_plane@cursor-fb-leak-pipe-c.html A CI Bug Log filter associated to this bug has been updated: {- CML: Random tests - incomplete - watchdog0: watchdog did not stop! -} {+ CML: Random tests - incomplete - watchdog0: watchdog did not stop! +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6083/re-cml-u/igt@i915_pm_rpm@system-suspend-execbuf.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_286/fi-cml-u/igt@kms_vblank@pipe-a-accuracy-idle.html The CI Bug Log issue associated to this bug has been updated. ### New filters associated * KBL: igt@kms_color@pipe-b-ctm-0-25 - incomplete - watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_303/fi-kbl-7560u/igt@kms_color@pipe-b-ctm-0-25.html But still seems to be present. Assessment stays true. <7>[ 26.796741] [IGT] kms_universal_plane: exiting, ret=0 <NOTHIGN FROM IGT> <2>[ 27.667078] watchdog: watchdog0: watchdog did not stop! That's 1s in which we do nothing, and then just die, which looks suspicious. Jenkin's side log end sooner, I guess this did not get enough time to sync over network. Few ideas that may help: 1. make ioctls/writes to watchdog more robust (retry on EAGAIN or EINTR) 2. log if we decrease watchdog timeout 3. log watchdog related messages to the dmesg too (this makes them recoverable from pstores/dmesg logs) or as a simpler (3) just 2>&1 | tee runner.log and publish those too There's now https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6264/fi-bdw-samus/igt_runner0.log which should contain full output of igt_runner, even when there were some network issues. Assuming it was correctly synced to the disk. It should be checked next time we have an occurance. After studying softdog (watchdog that is being used) it seems like there is not much that we can do in terms of ioctl. There are quick and quite simple, no EAGAINs, nothing, but we should still report errors if something happens. Other improvement would be stopping watchdog on SIGTERM and logging that runner is being terminated. I will send a patch. I might have an idea what actually happens: 1. Jenkins loses connection to the machine 2. because of that igt_runner gets SIGHUP or so 3. runner dies, so all fds are closed 4. all watchdogs are closed but without preceding magic "V" to stop them so we get the "watchdog: watchdog0: watchdog did not stop!" in dmesg We could get some signal handlers in place to be more explicit about runner dying. The CI Bug Log issue associated to this bug has been updated. ### New filters associated * BSW: Random tests - incomplete - watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_308/fi-bsw-kefka/igt@kms_addfb_basic@bo-too-small.html The series mentioned above has been pushed. Hopefully it turns out that those incompletes were just unhandled SIGHUPs. Let's check in a week. Seems like handling SIGHUP might have done something. We haven't seen failures caught under this bug the whole last week. Previously there were seen on average 4 times a week. 2 more weeks like that and we can close it :-) The CI Bug Log issue associated to this bug has been updated. ### New filters associated * HSW: igt@perf_pmu@idle-vcs0 - timeout - watchdog: watchdog0: watchdog did not stop! - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6423/shard-hsw4/igt@perf_pmu@idle-vcs0.html (In reply to CI Bug Log from comment #40) > The CI Bug Log issue associated to this bug has been updated. > > ### New filters associated > > * HSW: igt@perf_pmu@idle-vcs0 - timeout - watchdog: watchdog0: watchdog did > not stop! > - > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6423/shard-hsw4/ > igt@perf_pmu@idle-vcs0.html Ok this failure is quite something. 1. in stdout/stderr Test requirement not met in function igt_device_set_master, file ../lib/igt_device.c:61: Test requirement: __igt_device_set_master(fd) == 0 Can't become DRM master, please check if no other DRM client is running. command pid dev master a uid magic kms_cursor_crc 6870 0 y y 0 0 kms_cursor_crc 6870 0 n y 0 0 perf_pmu 6999 0 n y 0 0 perf_pmu 6999 0 n y 0 0 Received signal SIGQUIT. This means that we have some other IGT test alive that should not have been there. Runner should have killed them or failed earlier if they are not killable. I have to double-check child handling in igt_runner. 2. run.log told us that we got EBADF Failed to stop a watchdog: Bad file descriptor Failed to stop a watchdog: Bad file descriptor Timestamp 1562355594 /tmp/jenkins2496193922192201218.sh: line 118: 6978 Terminated stdbuf -o0 dmesg -rw > "$LOGDIR/$RUN/dmesg.log" 2>&1 (wd: /opt/igt) So the runner was terminating just fine and even tried to stop watchdog (now we can tell because logging added in comment #36) but it failed. I guess we need to start logging fds while initializing and then when closing to see what is going on. Anyway, seems like the system was in a completely broken state (hanging processes, defunct watchdogs). Let's see whether we see similar issues with other failures. This is a pretty wild ride so far. There are a couple of things that are happening here: 1. we may be closing watchdogs "twice", once by calling it explicitly on a regular exit code path, and the second time from atexit() handler since all those path are graceful(-ish), which is good because we close the watchdogs at the end of execution phase before results' intermediaries are parsed and compiled into .json 2. we should register signal handlers for SIGTERM, SIGHUP, SIGINT, SIGQUIT while setting up watchdogs, as there is small window of opportunity between each subtest execution that has not masked the signals. So if the signal is timed just right we won't close the dogs. execute_next_entry() is called for each "entry" (usually subtests) and does the masking and unmasking. 3. we should increase runner's verbosity level in CI to get more data on watchdog handling working on patches to address some of the points here Since merging of the series adding support for SIGHUP this bug seems to be rarer, but we need few more weeks before being sure. Hopefully with the signal handling reworked to always encompass watchdogs being active we will see this bug gone, but it still needs review. The series[0] overhauling signal handling has landed. Let's get back to this bug in 4 weeks, we should be able to get enough data to say if it had any impact. [0]: https://patchwork.freedesktop.org/series/63439/ The bug was seen about 8 times a week, then with the first batch of fixes the rate dropped a bit and now we haven't seen it in almost a month. Since 28 days > 10x longer than (once a day) I am closing this bug. The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.