Bug 71564

Summary: [IVB bisected] igt/gem_storedw_batches_loop/normal causes [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Chris Wilson <chris>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: major    
Priority: high CC: intel-gfx-bugs
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg none

Description lu hua 2013-11-13 08:17:44 UTC
Created attachment 89125 [details]
dmesg

System Environment:
--------------------------
Platform: Ivybridge
Kernel:	(drm-intel-nightly)8e88bd3a304ff70d23c0586be7531e24a56f3931

Bug detailed description:
-------------------------
Run ./gem_storedw_batches_loop --run-subtest normal, It causes <3>[   28.713648] [drm:i915_hangcheck_elapsed] *ERROR* Hangcheck timer elapsed... blitter ring idle.
It happens on ivybridge with -nightly kernel and -queued kernel.
It works well on -fixes kernel and debug kernel.

The latest known good commit:da66146425c3136943452988afd3d64cd551da58
The latest known bad commit: a94b013b91de055572183c6772865123fa955027

output:
running storedw loop with stall every 1 batch
completed 524288 writes successfully
running storedw loop with stall every 2 batch
completed 524288 writes successfully
running storedw loop with stall every 3 batch
completed 524288 writes successfully
running storedw loop with stall every 5 batch
completed 524288 writes successfully
Subtest normal: SUCCESS

Reproduce steps:
-------------------------
1. ./gem_storedw_batches_loop --run-subtest normal
Comment 1 Daniel Vetter 2013-11-13 10:25:05 UTC
Can you please bisect where this regression has been introduced? Also please attach the error state.
Comment 2 Ben Widawsky 2013-11-13 21:46:39 UTC
Please retest with latest igt.
Comment 3 lu hua 2013-11-15 08:15:07 UTC
It still happens on latest igt.
no error state collected in debug/dri/0/i915_error_state.
Comment 4 Daniel Vetter 2013-11-16 12:21:22 UTC
Still waiting for the bisect.
Comment 5 Daniel Vetter 2013-11-16 12:23:08 UTC
Also the output you've pasted indicates that the test worked, and now that we don't capture an error state any more I'm confused where the problem is. Please clarify.
Comment 6 lu hua 2013-11-18 06:28:05 UTC
Bisect shows:094f9a54e35500739da185cdb78f2e92fc379458 is the first bad commit.
commit 094f9a54e35500739da185cdb78f2e92fc379458
Author:     Chris Wilson <chris@chris-wilson.co.uk>
AuthorDate: Wed Sep 25 17:34:55 2013 +0100
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Thu Oct 3 20:01:30 2013 +0200

    drm/i915: Fix __wait_seqno to use true infinite timeouts

    When we switched to always using a timeout in conjunction with
    wait_seqno, we lost the ability to detect missed interrupts. Since, we
    have had issues with interrupts on a number of generations, and they are
    required to be delivered in a timely fashion for a smooth UX, it is
    important that we do log errors found in the wild and prevent the
    display stalling for upwards of 1s every time the seqno interrupt is
    missed.

    Rather than continue to fix up the timeouts to work around the interface
    impedence in wait_event_*(), open code the combination of
    wait_event[_interruptible][_timeout], and use the exposed timer to
    poll for seqno should we detect a lost interrupt.

    v2: In order to satisfy the debug requirement of logging missed
    interrupts with the real world requirments of making machines work even
    if interrupts are hosed, we revert to polling after detecting a missed
    interrupt.

    v3: Throw in a debugfs interface to simulate broken hw not reporting
    interrupts.
Comment 7 Daniel Vetter 2013-11-18 06:48:11 UTC
I'm still confused how the test actually fails, please explain.

If there's a gpu hang also please attach the error state.
Comment 8 Chris Wilson 2013-11-18 10:49:45 UTC
Your bisect is the messenger, not the cause.
Comment 9 Chris Wilson 2013-11-18 10:51:13 UTC
(In reply to comment #7)
> I'm still confused how the test actually fails, please explain.
> 
> If there's a gpu hang also please attach the error state.

It's a missed interrupt, not exactly a GPU hang. We don't dump the error state as hangcheck finds the GPU idle rather than stuck.
Comment 10 lu hua 2013-11-20 07:48:27 UTC
It works well on latest -nightly kernel. Close it.
Comment 11 lu hua 2013-11-20 07:48:57 UTC
Verified.Fixed.
Comment 12 Elizabeth 2017-10-06 14:42:08 UTC
Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.