Bug 100942 - [BAT][ELK] The gpu failed to reset when executing igt@gem_exec_fence@await-hang-default in CI
Summary: [BAT][ELK] The gpu failed to reset when executing igt@gem_exec_fence@await-ha...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: highest critical
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: PatchMerged
Keywords:
: 100943 100999 (view as bug list)
Depends on:
Blocks:
 
Reported: 2017-05-05 07:21 UTC by Martin Peres
Modified: 2017-07-27 16:50 UTC (History)
2 users (show)

See Also:
i915 platform: G45
i915 features: GPU hang


Attachments

Description Martin Peres 2017-05-05 07:21:00 UTC
The machine fi-elk-e7500 failed to reset the GPU when executing  igt@gem_exec_fence@await-hang-default on CI_DRM_2587.


Out	
IGT-Version: 1.18-g862fa60 (x86_64) (Linux: 4.11.0-CI-CI_DRM_2587+ x86_64)
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [test_fence_await+0x38b]
  #2 [<unknown>+0x38b]
  #3 [<unknown>+0x38b]
Subtest await-hang-default: FAIL (8.891s)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)

Err	
(gem_exec_fence:1593) CRITICAL: Test assertion failure function test_fence_await, file gem_exec_fence.c:316:
(gem_exec_fence:1593) CRITICAL: Failed assertion: out[i] == i
(gem_exec_fence:1593) CRITICAL: error: 0 != 0x2
Subtest await-hang-default failed.
**** DEBUG ****
(gem_exec_fence:1593) CRITICAL: Test assertion failure function test_fence_await, file gem_exec_fence.c:316:
(gem_exec_fence:1593) CRITICAL: Failed assertion: out[i] == i
(gem_exec_fence:1593) CRITICAL: error: 0 != 0x2
****  END  ****

Dmesg	
[   93.747176] drm/i915: Resetting chip after gpu hang
[   94.250000] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -110


Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2587/fi-elk-e7500/igt@gem_exec_fence@await-hang-default.html
Comment 1 Ricardo 2017-05-09 17:10:41 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
Comment 3 Mika Kuoppala 2017-05-16 12:24:22 UTC
elk reset is not working as expected. Sometimes resetting a gpu results
whole whole gpu getting stuck (~5% chance)

No workaround found yet. Retrying the reset nor going to d3 -> d0 helps to revive.
Comment 4 Ville Syrjala 2017-05-16 12:34:52 UTC
Does it fail when doing the media reset or the render reset?

There are a couple of interesting looking w/a names in the database:
WaMediaResetBeforeFullReset and WaMediaResetMainRingCleanup

WaMediaResetBeforeFullReset might just mean that we should change the order in which we do the resets.

Not sure what WaMediaResetMainRingCleanup might be about. Does it mean that we should do a media reset when cleaning up the ring, or that we should do some
kind of ring cleanup when we do a media reset.
Comment 5 Mika Kuoppala 2017-05-16 14:07:03 UTC
Failed even with media reset before render.

I will try next with stopping the rings in reverse order prior
the reset. That is the best I can imagine WaMediaResetMainRingCleanup to be.
Comment 6 Mika Kuoppala 2017-05-16 14:10:57 UTC
(In reply to Ville Syrjala from comment #4)
> Does it fail when doing the media reset or the render reset?

It fails when doing the render reset. The reset never completes and
the gpu seems to be stuck. As the ring inits will fail if trying to restart.
Comment 7 Mika Kuoppala 2017-05-19 13:10:34 UTC
commit 2c80353f3cd0cd4b28b17d55226e5914d2c0d5e1
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Fri May 19 12:13:40 2017 +0300

    drm/i915/g4x: Improve gpu reset reliability
Comment 8 Mika Kuoppala 2017-05-19 13:27:42 UTC
*** Bug 100943 has been marked as a duplicate of this bug. ***
Comment 9 Mika Kuoppala 2017-05-19 13:36:56 UTC
*** Bug 100999 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.