Bug 104007

Summary: [BAT] igt@gem_ringfill@basic-default-hang - dmesg-warn - *ERROR* Failed to reset chip: -110 - and the aftermaths of a wedged CPU.
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED WORKSFORME QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: GM45 i915 features: GEM/Other

Description Marta Löfstedt 2017-12-01 06:45:55 UTC
Starting at CI_DRM_3422 and all following runs, so far up until CI_DRM_3425

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3422/fi-ctg-p8600/igt@gem_ringfill@basic-default-hang.html

dmesg-warn:
[  314.726587] i915 0000:00:02.0: Resetting chip after gpu hang
[  317.742384] [drm:i915_reset [i915]] *ERROR* Failed to reset chip: -110

the due to:
IGT-Version: 1.20-gac6739bc (x86_64) (Linux: 4.15.0-rc1-CI-CI_DRM_3425+ x86_64)
Test requirement not met in function igt_require_gem, file ioctl_wrappers.c:1427:
Test requirement: err == 0
Unresponsive i915/GEM device
Last errno: 5, Input/output error
Subtest basic-all: SKIP

the following gem tests are skipped:
igt@gem_sync@basic-all
igt@gem_sync@basic-each
igt@gem_sync@basic-many-each
igt@gem_sync@basic-store-all
igt@gem_sync@basic-store-each
igt@gem_tiled_blits@basic
igt@gem_tiled_fence_blits@basic

some kms test pass, but then skipping due to the same reason as above for:
igt@kms_busy@basic-flip-a
igt@kms_busy@basic-flip-b
igt@kms_cursor_legacy@basic-busy-flip-before-cursor-legacy

then fail on:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3422/fi-ctg-p8600/igt@kms_pipe_crc_basic@hang-read-crc-pipe-a.html
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3422/fi-ctg-p8600/igt@kms_pipe_crc_basic@hang-read-crc-pipe-b.html

(kms_pipe_crc_basic:3585) igt-gt-CRITICAL: Test assertion failure function igt_force_gpu_reset, file igt_gt.c:406:
(kms_pipe_crc_basic:3585) igt-gt-CRITICAL: Failed assertion: !wedged
(kms_pipe_crc_basic:3585) igt-gt-CRITICAL: Last errno: 9, Bad file descriptor
Subtest hang-read-crc-pipe-B failed.

the on one run so far there is a Softdog incomplete on:
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_3424/fi-ctg-p8600/igt@kms_pipe_crc_basic@nonblocking-crc-pipe-a.html
Comment 1 Marta Löfstedt 2017-12-01 11:50:28 UTC
CI_DRM_3426 - CI_DRM_3428 are green, so maybe this isn't a regression after all.
Comment 2 Chris Wilson 2017-12-01 16:33:38 UTC
commit f7096d40eea84d32eb1e3b0f2b4407167aae9a83
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Dec 1 12:20:11 2017 +0000

    drm/i915: Sleep and retry a GPU reset if at first we don't succeed
    
    As we declare the GPU wedged if the reset fails, such a failure is quite
    terminal. Before taking that drastic action, let's sleep first and try
    active, in the hope that the hardware has quietened down and is then
    able to reset. After a few such attempts, it is fair to say that the HW
    is truly wedged.
    
    v2: Always print the failure message now, we precheck whether resets are
    disabled.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=104007
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20171201122011.16841-1-chris@chris-wilson.co.uk
    Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>

Now let's just wait a month to see if it reoccurs.
Comment 3 Marta Löfstedt 2017-12-04 08:46:19 UTC
(In reply to Chris Wilson from comment #2)
...
> 
> Now let's just wait a month to see if it reoccurs.

Well, that will be difficult since fi-ctg-p8600 hasn't been in the lab since CI_DRM_3432.

From IRC #intel-gfx
<marta_> tsa, what's up with fi-ctg-p8600, it hasn't been in the lab since CI_DRM_3432, ickle has a fix for https://bugs.freedesktop.org/show_bug.cgi?id=104007 that needs to be verified.
* tarceri (~tarceri@101.176.24.57) has joined
<tsa> marta_: it has the poweron problem, as always
<tsa> WOL and AC-boot are not reliable on that machine
Comment 4 Marta Löfstedt 2017-12-04 12:29:55 UTC
CTG is back up and has no issues on gem_ringfill@basic-default-hang

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.