Bug 65394 - [SNB Bisected]igt/gem_hangcheck_forcewake cause [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x3b4
Summary: [SNB Bisected]igt/gem_hangcheck_forcewake cause [drm:i915_hangcheck_elapsed] ...
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: high major
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
: 65397 (view as bug list)
Depends on:
Blocks:
 
Reported: 2013-06-05 07:40 UTC by lu hua
Modified: 2017-10-06 14:46 UTC (History)
1 user (show)

See Also:
i915 platform:
i915 features:


Attachments
i915_error_state (2.12 MB, text/plain)
2013-06-05 08:38 UTC, lu hua
no flags Details
Reset hangcheck score after kicking (1.16 KB, patch)
2013-06-05 09:32 UTC, Chris Wilson
no flags Details | Splinter Review
Don't count semaphore waits towards a hang (3.33 KB, patch)
2013-06-05 09:32 UTC, Chris Wilson
no flags Details | Splinter Review
Don't count semaphore waits towards a hang (4.47 KB, patch)
2013-06-05 09:46 UTC, Chris Wilson
no flags Details | Splinter Review

Description lu hua 2013-06-05 07:40:36 UTC
System Environment:
--------------------------
Arch:           i386
Platform:       Sandybridge
Kernel:     (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00

Bug detailed description:
-------------------------
It happens on sandybridge with drm-intel-next-queued kernel.It works well on drm-intel-fixes kernel.
Bisect shows:05407ff889ceebe383aa5907219f86582ef96b72 is the first bad commit.
commit 05407ff889ceebe383aa5907219f86582ef96b72
Author:     Mika Kuoppala <mika.kuoppala@linux.intel.com>
AuthorDate: Thu May 30 09:04:29 2013 +0300
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Mon Jun 3 10:58:21 2013 +0200

    drm/i915: detect hang using per ring hangcheck_score

    Keep track of ring seqno progress and if there are no
    progress detected, declare hang. Use actual head (acthd)
    to distinguish between ring stuck and batchbuffer looping
    situation. Stuck ring will be kicked to trigger progress.

    This commit adds a hard limit for batchbuffer completion time.
    If batchbuffer completion time is more than 4.5 seconds,
    the gpu will be declared hung.

    Review comment from Ben which nicely clarifies the semantic change:

    "Maybe I'm just stating the functional changes of the patch, but in case
    they were unintended here is what I see as potential issues:

    1. "If ring B is waiting on ring A via semaphore, and ring A is making
       progress, albeit slowly - the hangcheck will fire. The check will
       determine that A is moving, however ring B will appear hung because
       the ACTHD doesn't move. I honestly can't say if that's actually a
       realistic problem to hit it probably implies the timeout value is too
       low.

    2. "There's also another corner case on the kick. If the seqno = 2
       (though not stuck), and on the 3rd hangcheck, the ring is stuck, and
       we try to kick it... we don't actually try to find out if the kick
       helped"

    v2: use atchd to detect stuck ring from loop (Ben Widawsky)

    v3: Use acthd to check when ring needs kicking.
    Declare hang on third time in order to give time for
    kick_ring to take effect.

    v4: Update commit msg

output:
filling ring
waiting
done waiting, check dmesg

dmesg:
[60161.225096] [drm:i915_driver_open],
[60161.225116] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0)
[60161.225120] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60161.225122] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[60161.225124] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60161.225129] [drm:i915_driver_open],
[60171.239225] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0)
[60171.239231] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60171.239233] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[60171.239235] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60171.239254] [drm:i915_driver_open],
[60171.239259] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0)
[60171.239261] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60171.239262] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[60171.239263] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60171.239267] [drm:i915_driver_open],
[60176.787935] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x3b4
[60176.788236] [drm:i915_error_work_func], resetting chip
[60176.790751] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[60176.790883] [drm:ironlake_update_plane], Writing base 00072000 00000000 0 0 5120
[60176.790940] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0)
[60176.790944] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60176.790946] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[60176.790947] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3]
[60176.801235] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1)
[60176.801239] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc
[60176.801259] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1
[60176.801261] [drm:intel_crt_detect], CRT detected via hotplug
[60176.801909] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1)
[60176.801912] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc
[60176.801923] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1
[60176.801924] [drm:intel_crt_detect], CRT detected via hotplug
[60176.802678] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1)
[60176.802680] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc
[60176.802691] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1
[60176.802692] [drm:intel_crt_detect], CRT detected via hotplug

Reproduce steps:
----------------
1. ./gem_hangcheck_forcewake
2. dmesg -r | egrep "<[1-6]>" |grep drm
Comment 1 Chris Wilson 2013-06-05 08:28:46 UTC
Where's the error-state?
Comment 2 lu hua 2013-06-05 08:38:24 UTC
Created attachment 80326 [details]
i915_error_state
Comment 3 Chris Wilson 2013-06-05 08:56:06 UTC
This is a genuine hangcheck failure. The bsd ring is waiting upon the blt ring which is chock full of busy work.

Mika, we should check to see if we are on a semaphore wait and discount that stuck ring if its target ring is still busy and not yet past the seqno.
Comment 4 Daniel Vetter 2013-06-05 09:04:31 UTC
*** Bug 65397 has been marked as a duplicate of this bug. ***
Comment 5 Chris Wilson 2013-06-05 09:32:17 UTC
Created attachment 80331 [details] [review]
Reset hangcheck score after kicking
Comment 6 Chris Wilson 2013-06-05 09:32:45 UTC
Created attachment 80333 [details] [review]
Don't count semaphore waits towards a hang
Comment 7 Chris Wilson 2013-06-05 09:46:40 UTC
Created attachment 80335 [details] [review]
Don't count semaphore waits towards a hang
Comment 8 lu hua 2013-06-06 05:55:49 UTC
Test with patch "Reset hangcheck score after kicking" and "Don't count semaphore waits towards a hang" . It fixed.
Comment 9 lu hua 2013-06-07 08:12:05 UTC
igt/kms_flip/delayed-flip-vs-panning and igt/kms_flip/delayed-wf_vblank-vs-modeset also cause GPU hung on Haswell and have same bisect commit.
Comment 10 lu hua 2013-06-09 03:12:19 UTC
(In reply to comment #9)
> igt/kms_flip/delayed-flip-vs-panning and
> igt/kms_flip/delayed-wf_vblank-vs-modeset also cause GPU hung on Haswell and
> have same bisect commit.

It also exists on (linux-3.9.y)5dd2e9869de2d28fc7e5c274ff9c12af4361ba86(3.9.5) kernel.
Comment 11 Chris Wilson 2013-06-12 09:32:38 UTC
Should be fixed now:

commit 6274f2126a0454d3c3df1bc9cc6f5e18302696f7
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Mon Jun 10 11:20:21 2013 +0100

    drm/i915: Don't count semaphore waits towards a stuck ring
Comment 12 lu hua 2013-06-13 05:29:06 UTC
Verified.Fixed.
Comment 13 Elizabeth 2017-10-06 14:46:09 UTC
Closing old verified.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.