Summary: | [CI][SHARDS] igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit | ||||||
---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> | ||||
Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||
Status: | RESOLVED MOVED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||
Severity: | normal | ||||||
Priority: | high | CC: | intel-gfx-bugs | ||||
Version: | XOrg git | ||||||
Hardware: | Other | ||||||
OS: | All | ||||||
Whiteboard: | ReadyForDev | ||||||
i915 platform: | SNB | i915 features: | GEM/Other | ||||
Attachments: |
|
Description
Martin Peres
2019-02-18 10:02:29 UTC
The CI Bug Log issue associated to this bug has been updated. ### New filters associated * SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5614/shard-snb7/igt@gem_eio@unwedge-stress.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5615/shard-snb7/igt@gem_eio@unwedge-stress.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5622/shard-snb1/igt@gem_eio@unwedge-stress.html It exceeded 3s in some runs. Gah. https://patchwork.freedesktop.org/patch/286706/ is my hope. Fingers crossed once again, commit 8f54b3c6c921275d10e33746553c40294ffa0d58 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Feb 19 12:21:57 2019 +0000 drm/i915: Trim delays for wedging CI still reports the occasional multi-second delay for resets, in particular along the wedge+recovery paths. As the likely, and unbounded, delay here is from sync_rcu, use the expedited variant instead. Testcase: igt/gem_eio/unwedge-stress Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190219122215.8941-7-chris@chris-wilson.co.uk A CI Bug Log filter associated to this bug has been updated: {- SNB: igt@gem_eio@unwedge-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4838/shard-snb5/igt@gem_eio@reset-stress.html Now that's just cruel, having supplied a patch specifically for the unwedge-stress subtest, you cross-pollute it with reset-stress! Not that it'll make much difference, but there is quite a difference in driver paths between the two subtests. (In reply to Chris Wilson from comment #5) > Now that's just cruel, having supplied a patch specifically for the > unwedge-stress subtest, you cross-pollute it with reset-stress! > > Not that it'll make much difference, but there is quite a difference in > driver paths between the two subtests. Sorry about that! However, unwedge-stress is still failing: - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4855/shard-snb5/igt@gem_eio@unwedge-stress.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_4858/shard-snb4/igt@gem_eio@unwedge-stress.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5671/shard-snb4/igt@gem_eio@unwedge-stress.html - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5672/shard-snb2/igt@gem_eio@unwedge-stress.html If the fix for these issues is not fixing the reset-stress issues, we'll create a new bug! We're just at a mercy of an unbounded wait. We're using sync_rcu_expedited everywhere we can here and still we get delayed. I'm tempted to remove the fail for the max timeout being several seconds so long as the median is reasonable (all the limits are arbitrary anyway). A CI Bug Log filter associated to this bug has been updated: {- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/igt@gem_eio@unwedge-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/igt@gem_eio@unwedge-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/igt@gem_eio@unwedge-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/igt@gem_eio@unwedge-stress.html (In reply to CI Bug Log from comment #8) > A CI Bug Log filter associated to this bug has been updated: > > {- SNB: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med > < limit && max < 5 * limit -} > {+ SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: > med < limit && max < 5 * limit +} > > New failures caught by the filter: > > * > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5947/shard-glk1/ > igt@gem_eio@unwedge-stress.html > > * > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5949/shard-glk8/ > igt@gem_eio@unwedge-stress.html > > * > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5950/shard-glk8/ > igt@gem_eio@unwedge-stress.html > > * > https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5951/shard-glk5/ > igt@gem_eio@unwedge-stress.html Also seen on GLK. A CI Bug Log filter associated to this bug has been updated: {- SNB GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_258/fi-byt-n2820/igt@gem_eio@unwedge-stress.html It looks like it was the reset worker feeding in the restart request that dragged us down. commit 79ffac8599c4d8aa84d313920d3d86d7361c252b Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Apr 24 21:07:17 2019 +0100 drm/i915: Invert the GEM wakeref hierarchy A CI Bug Log filter associated to this bug has been updated: {- SNB BYT GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6043/shard-skl5/igt@gem_eio@reset-stress.html A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/igt@gem_eio@unwedge-stress.html (In reply to CI Bug Log from comment #13) > A CI Bug Log filter associated to this bug has been updated: > > {- SNB BYT SKL GLK: igt@gem_eio@(reset|unwedge)-stress - fail - Failed > assertion: med < limit && max < 5 * limit -} > {+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed > assertion: med < limit && max < 5 * limit +} > > New failures caught by the filter: > > * > https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_293/fi-icl-u3/ > igt@gem_eio@unwedge-stress.html Reopened this bug as this failure happened on ICL. A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} No new failures caught with the new filter Created attachment 144631 [details]
attachment-13473-0.html
Dear sender,
i have take leave during ww26.2. Please call me cell phone if urgency, sorry for the inconvenience it might bring to you.
For reference, commit f0e39642f6f8da5406627bfa79c6600df949e203 (upstream/master, origin/master, origin/HEAD) Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Tue Jul 2 12:40:45 2019 +0100 i915/gem_eio: Assert the hanging request is correctly identified When forcing a reset, it is crucial that the kernel correctly identifies the injected hang. Verify this is the case for reset-stress. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> One hypothesis is that we are not resetting the guilty request and so hitting a hangcheck instead. A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5088/shard-apl8/igt@gem_eio@reset-stress.html <7> [944.138584] [IGT] Forcing GPU reset <7> [944.138848] [drm:i915_reset_device [i915]] resetting chip <5> [944.138957] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <7> [944.139197] [IGT] Checking that the GPU recovered <5> [944.162438] Setting dangerous option reset - tainting kernel <7> [944.275166] [drm:i915_reset_device [i915]] resetting chip <5> [944.276899] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <5> [944.277178] Setting dangerous option reset - tainting kernel <7> [944.277284] [IGT] Forcing GPU reset <7> [944.277557] [drm:i915_reset_device [i915]] resetting chip <5> [944.278273] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <7> [944.278579] [IGT] Checking that the GPU recovered <5> [944.302432] Setting dangerous option reset - tainting kernel <7> [946.381889] [drm:i915_reset_device [i915]] resetting chip <5> [946.382011] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <5> [946.382270] Setting dangerous option reset - tainting kernel <7> [946.382345] [IGT] Forcing GPU reset <7> [946.382557] [drm:i915_reset_device [i915]] resetting chip <5> [946.383318] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff <7> [946.383621] [IGT] Checking that the GPU recovered <6> [946.475026] [IGT] gem_eio: exiting, ret=98 Which confirms that normally we expect quick reset+recovery cycles (with a reset period of 100ms between iterations). It also tells us that the delay is before i915_reset_device (although we could do with drm.debug=7 to be sure), which is the preamble in i915_handle_error(). Of note the only thing there is synchronize_rcu_expedited(). :| A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL APL GLK ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6470/re-cml-u/igt@gem_eio@reset-stress.html A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL BXT APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_338/fi-bxt-j4205/igt@gem_eio@reset-stress.html A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL BXT APL GLK CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_362/fi-cfl-8109u/igt@gem_eio@reset-stress.html * https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_364/fi-cfl-guc/igt@gem_eio@reset-stress.html A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med < limit && max < 5 * limit +} New failures caught by the filter: * https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5193/shard-snb6/igt@gem_eio@kms.html The CI Bug Log issue associated to this bug has been updated. ### New filters associated * SNB: igt@runner@aborted - fail -Previous test: gem_eio (kms) - https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_3440/shard-snb5/igt@runner@aborted.html - https://intel-gfx-ci.01.org/tree/drm-tip/IGT_5193/shard-snb6/igt@runner@aborted.html A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med < limit && max < 5 * limit +} No new failures caught with the new filter A CI Bug Log filter associated to this bug has been updated: {- SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress|igt@gem_eio@kms - fail - Failed assertion: med < limit && max < 5 * limit -} {+ SNB BYT SKL BXT APL GLK CFL CML ICL: igt@gem_eio@(reset|unwedge)-stress - fail - Failed assertion: med < limit && max < 5 * limit +} No new failures caught with the new filter The CI Bug Log issue associated to this bug has been updated. ### Removed filters * SNB: igt@runner@aborted - fail -Previous test: gem_eio (kms) (added on 2 hours ago) Bug assessment: for over a month, reset-stress and unwedge-stress gem_eio subtests are passing on all platforms (including ICL), except SNB. Will watch for some more time and reduce the severity of the bug if the failures are not seen on other platforms. Also perhaps the SNB failures can be fixed by increasing the time to complete the wedge somewhat for SNB. Submitted the following patch (not tested) as a candidate fix for the SNB issue: i915/gem_eio: Attempt to fix reset-stress/unwedge-stress failures on SNB gem_eio reset-stress and unwedge-stress subtests are now passing on all platforms except SNB. Attempt to fix failures in SNB by giving a little more time to complete the wedge. Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=109661 Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com> diff --git a/tests/i915/gem_eio.c b/tests/i915/gem_eio.c index 892f3657c..20f66e00d 100644 --- a/tests/i915/gem_eio.c +++ b/tests/i915/gem_eio.c @@ -300,7 +300,7 @@ static void check_wait_elapsed(const char *prefix, int fd, igt_stats_t *st) * modeset back on) around resets, so may take a lot longer. */ limit = 250e6; - if (intel_gen(intel_get_drm_devid(fd)) < 5) + if (intel_gen(intel_get_drm_devid(fd)) <= 5) limit += 300e6; /* guestimate for 2x worstcase modeset */ med = igt_stats_get_median(st); Updating platform field accordingly. -- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/drm/intel/issues/232. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.