The expected execution time is not taken into account when creating shard lists, which may lead to execution times longer than the maximum execution time of a shard. This means we can exceed the timeout of the shard execution (see [1, 2]).
Let's discuss here how to handle this case as it is potentially killing any sharded run, reducing our coverage!
There might be something behind this. Runtime estimation data was collected from PASSED test runtimes of CI_DRM_4000 - 4066:
IGT_4432/shards$ shuf-check.py *
File x0000 estimated 707.57 unknown 0
File x0001 estimated 494.89 unknown 0
File x0002 estimated 669.97 unknown 0
File x0003 estimated 593.20 unknown 0
File x0004 estimated 590.12 unknown 0
File x0005 estimated 731.17 unknown 0
File x0006 estimated 392.13 unknown 0
File x0007 estimated 688.72 unknown 0
File x0008 estimated 356.21 unknown 0
File x0009 estimated 471.19 unknown 0
File x0010 estimated 492.80 unknown 0
File x0011 estimated 286.99 unknown 0
File x0012 estimated 568.27 unknown 0
File x0013 estimated 721.03 unknown 0
File x0014 estimated 686.85 unknown 0
File x0015 estimated 622.08 unknown 0
File x0016 estimated 641.90 unknown 0
File x0017 estimated 493.45 unknown 0
File x0018 estimated 641.31 unknown 0
File x0019 estimated 812.69 unknown 0
File x0020 estimated 742.41 unknown 0
File x0021 estimated 635.08 unknown 0
File x0022 estimated 565.93 unknown 0
File x0023 estimated 883.95 unknown 0
File x0024 estimated 586.08 unknown 0
File x0025 estimated 740.21 unknown 0
File x0026 estimated 530.69 unknown 0
File x0027 estimated 416.87 unknown 0
File x0028 estimated 371.33 unknown 0
File x0029 estimated 1318.44 unknown 0
File x0030 estimated 666.33 unknown 0
File x0031 estimated 428.05 unknown 0
File x0032 estimated 583.25 unknown 0
File x0033 estimated 434.65 unknown 0
File x0034 estimated 420.10 unknown 0
Shortest x0011 286.99
Longest x0029 1318.44
New method of splitting the tests between shards is now in use. The runtimes of shards still have wide variance (probably not easy to balance with randomization), but as a whole, they stay very stable from shuffle to shuffle. First runs will be IGTPW_1281 and (not yet existing) IGT_4442
Shard shuffling improved to take estimated test durations into account.
Not too sure we can close it. The problem is still possibly there, if all the tests become slower over one revision.
I think the proper fix will be the new runner, when it learns to stop running new tests after a certain time threshold. I'll re-assign the bug to Petri for his input.
Petri, could you add a feature that stops running new tests if we are too close to the deadline? This should mitigate this issue.
Tomi, how confortable would you be with bumping the shard timeout if you had the guarantee that tests are still being executed?
Author: Petri Latvala <firstname.lastname@example.org>
Date: Wed Oct 10 13:41:00 2018 +0300
runner: Add --overall-timeout
(In reply to Martin Peres from comment #5)
> Tomi, how confortable would you be with bumping the shard timeout if you had
> the guarantee that tests are still being executed?
Re-opening since the feature is still not in use.
What kind of time limits are we talking about? Timeout affects all runs.
I'd be more comfortable to exit igt early if we're close to the timeout. This would just get a note that not everything was tested due to limit, and no false positives from killing the job.
(In reply to Tomi Sarvela from comment #9)
> What kind of time limits are we talking about? Timeout affects all runs.
> I'd be more comfortable to exit igt early if we're close to the timeout.
> This would just get a note that not everything was tested due to limit, and
> no false positives from killing the job.
That's indeed the idea! We would need to trust the IGT runner (and maybe write support for an external watchdog), but that would definitely limit the amount of random noise, which makes incomplete reports hard to act upon automatically.
(In reply to Martin Peres from comment #8)
> Re-opening since the feature is still not in use.
Moving to CI Infra.
Added --overall-timeout to Farm 1.
Resolved, Martin, what next?
(In reply to Jani Saarinen from comment #13)
> Resolved, Martin, what next?
Need to check if this reduces the amount of noise and make sure our timeouts are set correctly. I guess we should output some metrics to make sure of that :)
*** Bug 108220 has been marked as a duplicate of this bug. ***