Bug 110246 - [CI][SHARDS] Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop!
Summary: [CI][SHARDS] Random tests - incomplete - empty stdout/stderr/dmesg - watchdog...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: high normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev, Triaged
Keywords:
Depends on:
Blocks:
 
Reported: 2019-03-26 10:20 UTC by Lakshmi
Modified: 2019-08-19 11:05 UTC (History)
3 users (show)

See Also:
i915 platform: CFL, ICL
i915 features: CI Infra


Attachments

Comment 1 Lakshmi 2019-03-26 10:21:01 UTC
306.475594] watchdog: watchdog0: watchdog did not stop!
Comment 3 CI Bug Log 2019-03-29 08:31:43 UTC
A CI Bug Log filter associated to this bug has been updated:

{- shard-iclb6: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5829/shard-iclb8/igt@kms_psr@psr2_cursor_mmap_cpu.html
Comment 4 CI Bug Log 2019-03-29 08:34:59 UTC
A CI Bug Log filter associated to this bug has been updated:

{- shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5830/shard-iclb5/igt@i915_hangman@error-state-capture-bsd1.html
Comment 5 CI Bug Log 2019-03-29 14:54:23 UTC
A CI Bug Log filter associated to this bug has been updated:

{- shard-iclb6 shard-iclb8: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5833/shard-iclb1/igt@i915_hangman@error-state-capture-bsd1.html
* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5836/fi-icl-u3/igt@gem_cpu_reloc@basic.html
Comment 6 CI Bug Log 2019-03-29 14:54:38 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

 No new failures caught with the new filter
Comment 8 CI Bug Log 2019-04-03 06:39:48 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

 No new failures caught with the new filter
Comment 9 CI Bug Log 2019-04-05 05:28:54 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4928/shard-iclb3/igt@gem_exec_store@cachelines-blt.html
Comment 10 CI Bug Log 2019-04-05 19:48:40 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5879/fi-icl-u2/igt@gem_exec_basic@basic-blt.html
Comment 11 CI Bug Log 2019-04-09 07:17:49 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_249/fi-icl-u3/igt@kms_flip@2x-single-buffer-flip-vs-dpms-off-vs-modeset.html
Comment 12 CI Bug Log 2019-04-09 09:34:50 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - watchdog0: watchdog did not stop! +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_248/fi-icl-u3/igt@kms_cursor_legacy@flip-vs-cursor-crc-legacy.html
Comment 14 CI Bug Log 2019-04-09 14:11:41 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - watchdog0: watchdog did not stop! -}
{+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +}

 No new failures caught with the new filter
Comment 15 CI Bug Log 2019-04-15 06:34:20 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -}
{+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5906/shard-iclb7/igt@kms_cursor_legacy@pipe-b-single-move.html
Comment 16 CI Bug Log 2019-04-15 06:36:48 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -}
{+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5913/shard-iclb7/igt@kms_frontbuffer_tracking@fbcpsr-2p-scndscrn-pri-indfb-draw-mmap-wc.html

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5913/shard-iclb7/igt@kms_frontbuffer_tracking@fbcpsr-2p-primscrn-cur-indfb-draw-render.html
Comment 17 CI Bug Log 2019-04-16 06:03:15 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -}
{+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_252/fi-icl-u3/igt@prime_vgem@wait-vebox.html
Comment 19 Lakshmi 2019-04-17 13:45:19 UTC
This is a CI issue like how we run the tests. Random tests are failed as incompletes.
The impact of this bug is that we don't test results from the incompletes. We can not know the user impact at the moment unless the bug is fixed, e.g we can't catch regressions.

All the failures in this bugs are seen only on ICL.
Comment 20 Martin Peres 2019-04-17 13:52:54 UTC
(In reply to Lakshmi from comment #19)
> This is a CI issue like how we run the tests. Random tests are failed as
> incompletes.
> The impact of this bug is that we don't test results from the incompletes.
> We can not know the user impact at the moment unless the bug is fixed, e.g
> we can't catch regressions.
> 
> All the failures in this bugs are seen only on ICL.

What Lakshmi was trying to say is that we cannot know the health of ICL until this bug is fixed, because we cannot distinguish if an incomplete comes because of a platform issue or a CI system issue.

We need to investigate why ICL is behaving so badly compared to the other shards or RVPs (or maybe we just try too hard compared to other platforms where we just ignore all incompletes), and experiment with potential fixes.
Comment 21 Lakshmi 2019-04-20 07:40:16 UTC
Arek/Petri any comments on this bug, how do we proceed with this bug?
Comment 22 Jani Saarinen 2019-04-22 08:41:41 UTC
Petri mentioned there has been some changes made on CI side so that omm killer do not kill java...So has this been seen after this change?
Comment 23 CI Bug Log 2019-04-26 07:14:55 UTC
A CI Bug Log filter associated to this bug has been updated:

{- ICL: Random tests - incomplete - empty stdout/stderr/dmesg -}
{+ ICL: Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop! +}


  No new failures caught with the new filter
Comment 24 CI Bug Log 2019-04-26 07:18:29 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* CFL: Random tests - incomplete - Empty logs - watchdog: watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_259/fi-cfl-8109u/igt@kms_cursor_edge_walk@pipe-b-64x64-right-edge.html
Comment 25 Lakshmi 2019-04-26 07:25:34 UTC
(In reply to CI Bug Log from comment #24)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * CFL: Random tests - incomplete - Empty logs - watchdog: watchdog0:
> watchdog did not stop!
>   -
> https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_259/fi-cfl-8109u/
> igt@kms_cursor_edge_walk@pipe-b-64x64-right-edge.html

Also seen on CFL.
Comment 26 CI Bug Log 2019-04-27 18:07:02 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* SKL: Random tests - incomplete - empty stdout/stderr/dmesg - watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_262/fi-skl-6600u/igt@kms_addfb_basic@framebuffer-vs-set-tiling.html
Comment 27 Arek Hiler 2019-04-30 06:26:01 UTC
(In reply to Jani Saarinen from comment #22)
> Petri mentioned there has been some changes made on CI side so that omm
> killer do not kill java...So has this been seen after this change?

I believe this is the patch:

commit a52cc643cfe6733465cfc9ccb3d21cbdc4fd7506
Author: Petri Latvala <petri.latvala@intel.com>
Date:   Thu Apr 11 12:40:12 2019 +0300

    runner: Make sure oom-killer doesn't kill the runner

    Tests that eat all of the RAM and then some to invoke the oom-killer
    deliberately sometimes cause extra casualties. Make sure the runner
    stays alive.

    Signed-off-by: Petri Latvala <petri.latvala@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>

The issue still occurs:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6005/fi-icl-u2/igt@gem_exec_basic@basic-blt.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6012/fi-icl-u2/igt@gem_exec_basic@basic-blt.html

Reproduction rate seems to be still in the same ballpark.
Comment 28 CI Bug Log 2019-05-13 06:21:22 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* fi-cml-u2: Random tests - incomplete -  watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6070/fi-cml-u2/igt@gem_cpu_reloc@basic.html
Comment 29 CI Bug Log 2019-05-15 05:52:52 UTC
A CI Bug Log filter associated to this bug has been updated:

{- fi-cml-u2: Random tests - incomplete -  watchdog0: watchdog did not stop! -}
{+ CML: Random tests - incomplete -  watchdog0: watchdog did not stop! +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_284/fi-cml-u/igt@gem_mmap_gtt@basic-read-no-prefault.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6081/fi-cml-u/igt@gem_cpu_reloc@basic.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_285/fi-cml-u/igt@kms_universal_plane@cursor-fb-leak-pipe-c.html
Comment 30 CI Bug Log 2019-05-16 08:06:50 UTC
A CI Bug Log filter associated to this bug has been updated:

{- CML: Random tests - incomplete -  watchdog0: watchdog did not stop! -}
{+ CML: Random tests - incomplete -  watchdog0: watchdog did not stop! +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6083/re-cml-u/igt@i915_pm_rpm@system-suspend-execbuf.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_286/fi-cml-u/igt@kms_vblank@pipe-a-accuracy-idle.html
Comment 31 CI Bug Log 2019-06-11 10:29:28 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* KBL: igt@kms_color@pipe-b-ctm-0-25 - incomplete - watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_303/fi-kbl-7560u/igt@kms_color@pipe-b-ctm-0-25.html
Comment 32 Arek Hiler 2019-06-13 06:33:13 UTC
But still seems to be present. Assessment stays true.

<7>[   26.796741] [IGT] kms_universal_plane: exiting, ret=0
<NOTHIGN FROM IGT>
<2>[   27.667078] watchdog: watchdog0: watchdog did not stop!

That's 1s in which we do nothing, and then just die, which looks suspicious. Jenkin's side log end sooner, I guess this did not get enough time to sync over network.

Few ideas that may help:
1. make ioctls/writes to watchdog more robust (retry on EAGAIN or EINTR)
2. log if we decrease watchdog timeout
3. log watchdog related messages to the dmesg too (this makes them recoverable from pstores/dmesg logs)
Comment 33 Arek Hiler 2019-06-13 06:39:15 UTC
or as a simpler (3) just 2>&1 | tee runner.log and publish those too
Comment 34 Arek Hiler 2019-06-14 06:14:52 UTC
There's now https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6264/fi-bdw-samus/igt_runner0.log which should contain full output of igt_runner, even when there were some network issues. Assuming it was correctly synced to the disk.

It should be checked next time we have an occurance.

After studying softdog (watchdog that is being used) it seems like there is not much that we can do in terms of ioctl. There are quick and quite simple, no EAGAINs, nothing, but we should still report errors if something happens.

Other improvement would be stopping watchdog on SIGTERM and logging that runner is being terminated.

I will send a patch.
Comment 35 Arek Hiler 2019-06-14 07:01:43 UTC
I might have an idea what actually happens:

1. Jenkins loses connection to the machine
2. because of that igt_runner gets SIGHUP or so
3. runner dies, so all fds are closed
4. all watchdogs are closed but without preceding magic "V" to stop them
   so we get the "watchdog: watchdog0: watchdog did not stop!" in dmesg

We could get some signal handlers in place to be more explicit about runner dying.
Comment 36 Arek Hiler 2019-06-14 08:32:20 UTC
https://patchwork.freedesktop.org/series/62093/
Comment 37 CI Bug Log 2019-06-18 05:53:55 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* BSW: Random tests - incomplete - watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_308/fi-bsw-kefka/igt@kms_addfb_basic@bo-too-small.html
Comment 38 Arek Hiler 2019-06-24 10:23:08 UTC
The series mentioned above has been pushed. Hopefully it turns out that those
incompletes were just unhandled SIGHUPs.

Let's check in a week.
Comment 39 Arek Hiler 2019-07-01 06:36:16 UTC
Seems like handling SIGHUP might have done something.

We haven't seen failures caught under this bug the whole last week. Previously there were seen on average 4 times a week. 2 more weeks like that and we can close it :-)
Comment 40 CI Bug Log 2019-07-08 07:17:41 UTC
The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* HSW: igt@perf_pmu@idle-vcs0 - timeout - watchdog: watchdog0: watchdog did not stop!
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6423/shard-hsw4/igt@perf_pmu@idle-vcs0.html
Comment 41 Arek Hiler 2019-07-08 12:04:13 UTC
(In reply to CI Bug Log from comment #40)
> The CI Bug Log issue associated to this bug has been updated.
> 
> ### New filters associated
> 
> * HSW: igt@perf_pmu@idle-vcs0 - timeout - watchdog: watchdog0: watchdog did
> not stop!
>   -
> https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6423/shard-hsw4/
> igt@perf_pmu@idle-vcs0.html

Ok this failure is quite something. 

1. in stdout/stderr 
	
Test requirement not met in function igt_device_set_master, file ../lib/igt_device.c:61:
Test requirement: __igt_device_set_master(fd) == 0
Can't become DRM master, please check if no other DRM client is running.
             command   pid dev master a   uid      magic
      kms_cursor_crc  6870   0   y    y     0          0
      kms_cursor_crc  6870   0   n    y     0          0
            perf_pmu  6999   0   n    y     0          0
            perf_pmu  6999   0   n    y     0          0
Received signal SIGQUIT.


This means that we have some other IGT test alive that should not have been there. Runner should have killed them or failed earlier if they are not killable. I have to double-check child handling in igt_runner.

2. run.log told us that we got EBADF

Failed to stop a watchdog: Bad file descriptor
Failed to stop a watchdog: Bad file descriptor
Timestamp 1562355594
/tmp/jenkins2496193922192201218.sh: line 118:  6978 Terminated              stdbuf -o0 dmesg -rw > "$LOGDIR/$RUN/dmesg.log" 2>&1  (wd: /opt/igt)

So the runner was terminating just fine and even tried to stop watchdog (now we can tell because logging added in comment #36) but it failed. I guess we need to start logging fds while initializing and then when closing to see what is going on.

Anyway, seems like the system was in a completely broken state (hanging processes, defunct watchdogs). Let's see whether we see similar issues with other failures.
Comment 42 Arek Hiler 2019-07-09 07:40:22 UTC
This is a pretty wild ride so far. There are a couple of things that are happening here:

1. we may be closing watchdogs "twice", once by calling it explicitly on a regular exit code path, and the second time from atexit() handler

since all those path are graceful(-ish), which is good because we close the watchdogs at the end of execution phase before results' intermediaries are parsed and compiled into .json

2. we should register signal handlers for SIGTERM, SIGHUP, SIGINT, SIGQUIT while setting up watchdogs, as there is small window of opportunity between each subtest execution that has not masked the signals. So if the signal is timed just right we won't close the dogs.

execute_next_entry() is called for each "entry" (usually subtests) and does the masking and unmasking.

3. we should increase runner's verbosity level in CI to get more data on watchdog handling

working on patches to address some of the points here
Comment 43 Arek Hiler 2019-07-10 09:27:46 UTC
https://patchwork.freedesktop.org/series/63439/
Comment 44 Arek Hiler 2019-07-17 14:36:00 UTC
Since merging of the series adding support for SIGHUP this bug seems to be rarer, but we need few more weeks before being sure.

Hopefully with the signal handling reworked to always encompass watchdogs being active we will see this bug gone, but it still needs review.
Comment 45 Arek Hiler 2019-07-22 13:36:14 UTC
The series[0] overhauling signal handling has landed. Let's get back to this bug in 4 weeks, we should be able to get enough data to say if it had any impact.

[0]: https://patchwork.freedesktop.org/series/63439/
Comment 46 Arek Hiler 2019-08-19 09:45:35 UTC
The bug was seen about 8 times a week, then with the first batch of fixes the rate dropped a bit and now we haven't seen it in almost a month.

Since 28 days > 10x longer than (once a day) I am closing this bug.
Comment 47 CI Bug Log 2019-08-19 11:05:24 UTC
The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.