Bug 100942

Summary: [BAT][ELK] The gpu failed to reset when executing igt@gem_exec_fence@await-hang-default in CI
Product: DRI Reporter: Martin Peres <martin.peres>
Component: DRM/IntelAssignee: Intel GFX Bugs mailing list <intel-gfx-bugs>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs, jani.saarinen
Version: XOrg git   
Hardware: Other   
OS: All   
Whiteboard: PatchMerged
i915 platform: G45 i915 features: GPU hang

Description Martin Peres 2017-05-05 07:21:00 UTC
The machine fi-elk-e7500 failed to reset the GPU when executing  igt@gem_exec_fence@await-hang-default on CI_DRM_2587.


Out	
IGT-Version: 1.18-g862fa60 (x86_64) (Linux: 4.11.0-CI-CI_DRM_2587+ x86_64)
Stack trace:
  #0 [__igt_fail_assert+0x101]
  #1 [test_fence_await+0x38b]
  #2 [<unknown>+0x38b]
  #3 [<unknown>+0x38b]
Subtest await-hang-default: FAIL (8.891s)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)
Test requirement not met in function __real_main508, file gem_exec_fence.c:525:
Test requirement: gem_has_ring(i915, e->exec_id | e->flags)

Err	
(gem_exec_fence:1593) CRITICAL: Test assertion failure function test_fence_await, file gem_exec_fence.c:316:
(gem_exec_fence:1593) CRITICAL: Failed assertion: out[i] == i
(gem_exec_fence:1593) CRITICAL: error: 0 != 0x2
Subtest await-hang-default failed.
**** DEBUG ****
(gem_exec_fence:1593) CRITICAL: Test assertion failure function test_fence_await, file gem_exec_fence.c:316:
(gem_exec_fence:1593) CRITICAL: Failed assertion: out[i] == i
(gem_exec_fence:1593) CRITICAL: error: 0 != 0x2
****  END  ****

Dmesg	
[   93.747176] drm/i915: Resetting chip after gpu hang
[   94.250000] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -110


Full logs: https://intel-gfx-ci.01.org/CI/CI_DRM_2587/fi-elk-e7500/igt@gem_exec_fence@await-hang-default.html
Comment 1 Ricardo 2017-05-09 17:10:41 UTC
Adding tag into "Whiteboard" field - ReadyForDev
The bug still active
*Status is correct
*Platform is included
*Feature is included
*Priority and Severity correctly set
Comment 3 Mika Kuoppala 2017-05-16 12:24:22 UTC
elk reset is not working as expected. Sometimes resetting a gpu results
whole whole gpu getting stuck (~5% chance)

No workaround found yet. Retrying the reset nor going to d3 -> d0 helps to revive.
Comment 4 Ville Syrjala 2017-05-16 12:34:52 UTC
Does it fail when doing the media reset or the render reset?

There are a couple of interesting looking w/a names in the database:
WaMediaResetBeforeFullReset and WaMediaResetMainRingCleanup

WaMediaResetBeforeFullReset might just mean that we should change the order in which we do the resets.

Not sure what WaMediaResetMainRingCleanup might be about. Does it mean that we should do a media reset when cleaning up the ring, or that we should do some
kind of ring cleanup when we do a media reset.
Comment 5 Mika Kuoppala 2017-05-16 14:07:03 UTC
Failed even with media reset before render.

I will try next with stopping the rings in reverse order prior
the reset. That is the best I can imagine WaMediaResetMainRingCleanup to be.
Comment 6 Mika Kuoppala 2017-05-16 14:10:57 UTC
(In reply to Ville Syrjala from comment #4)
> Does it fail when doing the media reset or the render reset?

It fails when doing the render reset. The reset never completes and
the gpu seems to be stuck. As the ring inits will fail if trying to restart.
Comment 7 Mika Kuoppala 2017-05-19 13:10:34 UTC
commit 2c80353f3cd0cd4b28b17d55226e5914d2c0d5e1
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Fri May 19 12:13:40 2017 +0300

    drm/i915/g4x: Improve gpu reset reliability
Comment 8 Mika Kuoppala 2017-05-19 13:27:42 UTC
*** Bug 100943 has been marked as a duplicate of this bug. ***
Comment 9 Mika Kuoppala 2017-05-19 13:36:56 UTC
*** Bug 100999 has been marked as a duplicate of this bug. ***

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.