106127 – [CI] [Intel-GFX-CI] shards may get killed early, creating more incompletes

Bug 106127 - [CI] [Intel-GFX-CI] shards may get killed early, creating more incompletes

Summary: [CI] [Intel-GFX-CI] shards may get killed early, creating more incompletes

Status:	CLOSED FIXED

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	XOrg git
Hardware:	Other All

Importance:	high enhancement
Assignee:	Petri Latvala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	ReadyForDev
Keywords:

Duplicates (1):	108220 (view as bug list)
Depends on:
Blocks:

Reported:	2018-04-18 14:35 UTC by Martin Peres
Modified:	2019-02-19 08:31 UTC (History)
CC List:	2 users (show)

See Also:	105617 105597 103713
i915 platform:	ALL
i915 features:	CI Infra

Attachments

Description Martin Peres 2018-04-18 14:35:32 UTC

The expected execution time is not taken into account when creating shard lists, which may lead to execution times longer than the maximum execution time of a shard. This means we can exceed the timeout of the shard execution (see [1, 2]).

Let's discuss here how to handle this case as it is potentially killing any sharded run, reducing our coverage!

[1] https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_22/fi-skl-6770hq/igt@kms_frontbuffer_tracking@fbc-1p-rte.html
[2]https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_22/fi-skl-6770hq/run29.log

Comment 1 Marta Löfstedt 2018-04-19 06:31:49 UTC

See: 
bug 105617
bug 105597
bug 103713

Comment 2 Tomi Sarvela 2018-04-19 13:23:41 UTC

There might be something behind this. Runtime estimation data was collected from PASSED test runtimes of CI_DRM_4000 - 4066:

IGT_4432/shards$ shuf-check.py *
File x0000 estimated 707.57 unknown 0
File x0001 estimated 494.89 unknown 0
File x0002 estimated 669.97 unknown 0
File x0003 estimated 593.20 unknown 0
File x0004 estimated 590.12 unknown 0
File x0005 estimated 731.17 unknown 0
File x0006 estimated 392.13 unknown 0
File x0007 estimated 688.72 unknown 0
File x0008 estimated 356.21 unknown 0
File x0009 estimated 471.19 unknown 0
File x0010 estimated 492.80 unknown 0
File x0011 estimated 286.99 unknown 0
File x0012 estimated 568.27 unknown 0
File x0013 estimated 721.03 unknown 0
File x0014 estimated 686.85 unknown 0
File x0015 estimated 622.08 unknown 0
File x0016 estimated 641.90 unknown 0
File x0017 estimated 493.45 unknown 0
File x0018 estimated 641.31 unknown 0
File x0019 estimated 812.69 unknown 0
File x0020 estimated 742.41 unknown 0
File x0021 estimated 635.08 unknown 0
File x0022 estimated 565.93 unknown 0
File x0023 estimated 883.95 unknown 0
File x0024 estimated 586.08 unknown 0
File x0025 estimated 740.21 unknown 0
File x0026 estimated 530.69 unknown 0
File x0027 estimated 416.87 unknown 0
File x0028 estimated 371.33 unknown 0
File x0029 estimated 1318.44 unknown 0
File x0030 estimated 666.33 unknown 0
File x0031 estimated 428.05 unknown 0
File x0032 estimated 583.25 unknown 0
File x0033 estimated 434.65 unknown 0
File x0034 estimated 420.10 unknown 0
Shortest x0011 286.99
Longest x0029 1318.44

New method of splitting the tests between shards is now in use. The runtimes of shards still have wide variance (probably not easy to balance with randomization), but as a whole, they stay very stable from shuffle to shuffle. First runs will be IGTPW_1281 and (not yet existing) IGT_4442

Comment 3 Tomi Sarvela 2018-05-07 13:35:17 UTC

Shard shuffling improved to take estimated test durations into account.

Comment 4 Martin Peres 2018-05-08 09:34:08 UTC

Not too sure we can close it. The problem is still possibly there, if all the tests become slower over one revision.

I think the proper fix will be the new runner, when it learns to stop running new tests after a certain time threshold. I'll re-assign the bug to Petri for his input.

Comment 5 Martin Peres 2018-06-15 08:58:10 UTC

Petri, could you add a feature that stops running new tests if we are too close to the deadline? This should mitigate this issue.

Tomi, how confortable would you be with bumping the shard timeout if you had the guarantee that tests are still being executed?

Comment 6 Petri Latvala 2018-10-31 12:07:12 UTC

commit 78619fde4008424c472906041edb1d204e014f7c
Author: Petri Latvala <petri.latvala@intel.com>
Date:   Wed Oct 10 13:41:00 2018 +0300

    runner: Add --overall-timeout

Comment 7 Martin Peres 2018-11-01 14:52:10 UTC

(In reply to Martin Peres from comment #5)
> Tomi, how confortable would you be with bumping the shard timeout if you had
> the guarantee that tests are still being executed?

Tomi?

Comment 8 Martin Peres 2018-11-01 14:52:34 UTC

Re-opening since the feature is still not in use.

Comment 9 Tomi Sarvela 2018-11-02 09:29:00 UTC

What kind of time limits are we talking about? Timeout affects all runs.

I'd be more comfortable to exit igt early if we're close to the timeout. This would just get a note that not everything was tested due to limit, and no false positives from killing the job.

Comment 10 Martin Peres 2018-11-02 09:43:55 UTC

(In reply to Tomi Sarvela from comment #9)
> What kind of time limits are we talking about? Timeout affects all runs.
> 
> I'd be more comfortable to exit igt early if we're close to the timeout.
> This would just get a note that not everything was tested due to limit, and
> no false positives from killing the job.

That's indeed the idea! We would need to trust the IGT runner (and maybe write support for an external watchdog), but that would definitely limit the amount of random noise, which makes incomplete reports hard to act upon automatically.

Comment 11 Petri Latvala 2018-11-12 11:41:30 UTC

(In reply to Martin Peres from comment #8)
> Re-opening since the feature is still not in use.

Moving to CI Infra.

Comment 12 Tomi Sarvela 2018-11-21 14:11:29 UTC

Added --overall-timeout to Farm 1.

Comment 13 Jani Saarinen 2018-11-30 09:11:33 UTC

Resolved, Martin, what next?

Comment 14 Martin Peres 2018-11-30 13:13:51 UTC

(In reply to Jani Saarinen from comment #13)
> Resolved, Martin, what next?

Need to check if this reduces the amount of noise and make sure our timeouts are set correctly. I guess we should output some metrics to make sure of that :)

Comment 15 Martin Peres 2018-11-30 13:32:40 UTC

*** Bug 108220 has been marked as a duplicate of this bug. ***

Comment 16 Francesco Balestrieri 2018-12-28 09:30:54 UTC

Is the fix confirmed?

Comment 17 Lakshmi 2019-02-19 08:31:54 UTC

Last seen drmtip_22 (10 months, 1 week / 5510 runs ago).
This bug has been archived. Closing this bug as fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.