System Environment: -------------------------- Arch: i386 Platform: Sandybridge Kernel: (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00 Bug detailed description: ------------------------- It happens on sandybridge with drm-intel-next-queued kernel.It works well on drm-intel-fixes kernel. Bisect shows:05407ff889ceebe383aa5907219f86582ef96b72 is the first bad commit. commit 05407ff889ceebe383aa5907219f86582ef96b72 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> AuthorDate: Thu May 30 09:04:29 2013 +0300 Commit: Daniel Vetter <daniel.vetter@ffwll.ch> CommitDate: Mon Jun 3 10:58:21 2013 +0200 drm/i915: detect hang using per ring hangcheck_score Keep track of ring seqno progress and if there are no progress detected, declare hang. Use actual head (acthd) to distinguish between ring stuck and batchbuffer looping situation. Stuck ring will be kicked to trigger progress. This commit adds a hard limit for batchbuffer completion time. If batchbuffer completion time is more than 4.5 seconds, the gpu will be declared hung. Review comment from Ben which nicely clarifies the semantic change: "Maybe I'm just stating the functional changes of the patch, but in case they were unintended here is what I see as potential issues: 1. "If ring B is waiting on ring A via semaphore, and ring A is making progress, albeit slowly - the hangcheck will fire. The check will determine that A is moving, however ring B will appear hung because the ACTHD doesn't move. I honestly can't say if that's actually a realistic problem to hit it probably implies the timeout value is too low. 2. "There's also another corner case on the kick. If the seqno = 2 (though not stuck), and on the 3rd hangcheck, the ring is stuck, and we try to kick it... we don't actually try to find out if the kick helped" v2: use atchd to detect stuck ring from loop (Ben Widawsky) v3: Use acthd to check when ring needs kicking. Declare hang on third time in order to give time for kick_ring to take effect. v4: Update commit msg output: filling ring waiting done waiting, check dmesg dmesg: [60161.225096] [drm:i915_driver_open], [60161.225116] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0) [60161.225120] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60161.225122] [drm:intel_crtc_set_config], [CRTC:5] [NOFB] [60161.225124] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60161.225129] [drm:i915_driver_open], [60171.239225] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0) [60171.239231] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60171.239233] [drm:intel_crtc_set_config], [CRTC:5] [NOFB] [60171.239235] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60171.239254] [drm:i915_driver_open], [60171.239259] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0) [60171.239261] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60171.239262] [drm:intel_crtc_set_config], [CRTC:5] [NOFB] [60171.239263] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60171.239267] [drm:i915_driver_open], [60176.787935] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x3b4 [60176.788236] [drm:i915_error_work_func], resetting chip [60176.790751] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B [60176.790883] [drm:ironlake_update_plane], Writing base 00072000 00000000 0 0 5120 [60176.790940] [drm:intel_crtc_set_config], [CRTC:3] [FB:27] #connectors=1 (x y) (0 0) [60176.790944] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60176.790946] [drm:intel_crtc_set_config], [CRTC:5] [NOFB] [60176.790947] [drm:intel_modeset_stage_output_state], [CONNECTOR:7:VGA-1] to [CRTC:3] [60176.801235] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1) [60176.801239] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc [60176.801259] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1 [60176.801261] [drm:intel_crt_detect], CRT detected via hotplug [60176.801909] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1) [60176.801912] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc [60176.801923] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1 [60176.801924] [drm:intel_crt_detect], CRT detected via hotplug [60176.802678] [drm:gmbus_xfer], GMBUS [i915 gmbus dpc] NAK for addr: 0050 r(1) [60176.802680] [drm:drm_do_probe_ddc_edid], drm: skipping non-existent adapter i915 gmbus dpc [60176.802691] [drm:intel_ironlake_crt_detect_hotplug], ironlake hotplug adpa=0x83f40018, result 1 [60176.802692] [drm:intel_crt_detect], CRT detected via hotplug Reproduce steps: ---------------- 1. ./gem_hangcheck_forcewake 2. dmesg -r | egrep "<[1-6]>" |grep drm
Where's the error-state?
Created attachment 80326 [details] i915_error_state
This is a genuine hangcheck failure. The bsd ring is waiting upon the blt ring which is chock full of busy work. Mika, we should check to see if we are on a semaphore wait and discount that stuck ring if its target ring is still busy and not yet past the seqno.
*** Bug 65397 has been marked as a duplicate of this bug. ***
Created attachment 80331 [details] [review] Reset hangcheck score after kicking
Created attachment 80333 [details] [review] Don't count semaphore waits towards a hang
Created attachment 80335 [details] [review] Don't count semaphore waits towards a hang
Test with patch "Reset hangcheck score after kicking" and "Don't count semaphore waits towards a hang" . It fixed.
igt/kms_flip/delayed-flip-vs-panning and igt/kms_flip/delayed-wf_vblank-vs-modeset also cause GPU hung on Haswell and have same bisect commit.
(In reply to comment #9) > igt/kms_flip/delayed-flip-vs-panning and > igt/kms_flip/delayed-wf_vblank-vs-modeset also cause GPU hung on Haswell and > have same bisect commit. It also exists on (linux-3.9.y)5dd2e9869de2d28fc7e5c274ff9c12af4361ba86(3.9.5) kernel.
Should be fixed now: commit 6274f2126a0454d3c3df1bc9cc6f5e18302696f7 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Mon Jun 10 11:20:21 2013 +0100 drm/i915: Don't count semaphore waits towards a stuck ring
Verified.Fixed.
Closing old verified.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.