65397 – [IVB Bisected]igt/gem_wait_render_timeout causes [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x144

Bug 65397 - [IVB Bisected]igt/gem_wait_render_timeout causes [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x144

Summary: [IVB Bisected]igt/gem_wait_render_timeout causes [drm:i915_hangcheck_elapsed]...

Status:	CLOSED DUPLICATE of bug 65394

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	unspecified
Hardware:	All Linux (All)

Importance:	high major
Assignee:	Mika Kuoppala
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:
Keywords:

Depends on:
Blocks:

Reported:	2013-06-05 08:02 UTC by lu hua
Modified:	2017-10-06 14:46 UTC (History)
CC List:	1 user (show)

See Also:
i915 platform:
i915 features:

Attachments

Description lu hua 2013-06-05 08:02:48 UTC

ystem Environment:
--------------------------
Arch:           x86_64
Platform:       Ivybridge
Kernel:     (drm-intel-next-queued)d7697eea3eec74c561d12887d892c53ac4380c00

Bug detailed description:
-------------------------
Run ./gem_wait_render_timeout, [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x144 appears in dmesg. 
It happens on only one IVB(Genuine Intel(R) CPU  @ 2.20GHz) with drm-intel-next-queued kernel, It doesn't happen on drm-intel-fixes kernel.

Bisect shows:05407ff889ceebe383aa5907219f86582ef96b72 is the first bad commit.
commit 05407ff889ceebe383aa5907219f86582ef96b72
Author:     Mika Kuoppala <mika.kuoppala@linux.intel.com>
AuthorDate: Thu May 30 09:04:29 2013 +0300
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Mon Jun 3 10:58:21 2013 +0200

    drm/i915: detect hang using per ring hangcheck_score

    Keep track of ring seqno progress and if there are no
    progress detected, declare hang. Use actual head (acthd)
    to distinguish between ring stuck and batchbuffer looping
    situation. Stuck ring will be kicked to trigger progress.

    This commit adds a hard limit for batchbuffer completion time.
    If batchbuffer completion time is more than 4.5 seconds,
    the gpu will be declared hung.

    Review comment from Ben which nicely clarifies the semantic change:

    "Maybe I'm just stating the functional changes of the patch, but in case
    they were unintended here is what I see as potential issues:

    1. "If ring B is waiting on ring A via semaphore, and ring A is making
       progress, albeit slowly - the hangcheck will fire. The check will
       determine that A is moving, however ring B will appear hung because
       the ACTHD doesn't move. I honestly can't say if that's actually a
       realistic problem to hit it probably implies the timeout value is too
       low.

    2. "There's also another corner case on the kick. If the seqno = 2
       (though not stuck), and on the 3rd hangcheck, the ring is stuck, and
       we try to kick it... we don't actually try to find out if the kick
       helped"

    v2: use atchd to detect stuck ring from loop (Ben Widawsky)

    v3: Use acthd to check when ring needs kicking.
    Declare hang on third time in order to give time for
    kick_ring to take effect.

    v4: Update commit msg


output:
32768 iters is enough work
Finished with 179580100 time remaining
signal handler called 1865 times

dmesg:
[  166.346777] [drm:i915_driver_open],
[  166.346792] [drm:intel_crtc_set_config], [CRTC:3] [NOFB]
[  166.346797] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[  166.346799] [drm:intel_crtc_set_config], [CRTC:7] [NOFB]
[  166.346804] [drm:i915_driver_open],
[  181.188260] [drm:i915_driver_open],
[  181.188277] [drm:i915_driver_open],
[  181.189097] [drm:i915_driver_open],
[  181.189107] [drm:i915_driver_open],
[  189.518017] [drm:i915_hangcheck_elapsed] *ERROR* bsd ring: stuck on addr 0x144
[  189.518221] [drm:i915_error_work_func], resetting chip
[  189.518979] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe A
[  189.518983] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe B
[  189.518984] [drm:gm45_get_vblank_counter], trying to get vblank count for disabled pipe C
[  189.520119] [drm:intel_crtc_set_config], [CRTC:3] [NOFB]
[  189.520124] [drm:intel_crtc_set_config], [CRTC:5] [NOFB]
[  189.520126] [drm:intel_crtc_set_config], [CRTC:7] [NOFB]

lspci:
00:00.0 Host bridge: Intel Corporation Ivy Bridge DRAM Controller (rev 04)
00:02.0 VGA compatible controller: Intel Corporation Device 0162 (rev 04)
00:14.0 USB Controller: Intel Corporation Panther Point USB xHCI Host Controller (rev 02)
00:16.0 Communication controller: Intel Corporation Panther Point MEI Controller #1 (rev 02)
00:16.3 Serial controller: Intel Corporation Panther Point KT Controller (rev 02)
00:19.0 Ethernet controller: Intel Corporation 82579LM Gigabit Network Connection (rev 02)
00:1a.0 USB Controller: Intel Corporation Panther Point USB Enhanced Host Controller #2 (rev 02)
00:1b.0 Audio device: Intel Corporation Panther Point High Definition Audio Controller (rev 02)
00:1c.0 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 1 (rev c2)
00:1c.6 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 7 (rev c2)
00:1c.7 PCI bridge: Intel Corporation Panther Point PCI Express Root Port 8 (rev c2)
00:1d.0 USB Controller: Intel Corporation Panther Point USB Enhanced Host Controller #1 (rev 02)
00:1e.0 PCI bridge: Intel Corporation 82801 PCI Bridge (rev a2)
00:1f.0 ISA bridge: Intel Corporation Panther Point LPC Controller (rev 02)
00:1f.2 SATA controller: Intel Corporation Panther Point 6 port SATA AHCI Controller (rev 02)
00:1f.3 SMBus: Intel Corporation Panther Point SMBus Controller (rev 02)
02:00.0 Network controller: Intel Corporation Centrino Advanced-N 6205 (rev 34)
03:00.0 PCI bridge: Texas Instruments XIO2000(A)/XIO2200A PCI Express-to-PCI Bridge (rev 03)
04:00.0 FireWire (IEEE 1394): Texas Instruments XIO2200A IEEE-1394a-2000 Controller (PHY/Link) (rev 01)

Reproduce steps:
----------------
1. ./gem_wait_render_timeout
2. dmesg -r | egrep "<[1-6]>" |grep drm

Comment 1 Daniel Vetter 2013-06-05 08:24:22 UTC

Mika, can you please take a look?

Comment 2 Chris Wilson 2013-06-05 08:25:24 UTC

If it only occurs on pre-release silicon, should we even bother to diagnose this?

Comment 3 Daniel Vetter 2013-06-05 09:03:55 UTC

(In reply to comment #2)
> If it only occurs on pre-release silicon, should we even bother to diagnose
> this?

Bug report says it happens on ivb ... sounds more like a dupe of bug #65394 to me.

Comment 4 Daniel Vetter 2013-06-05 09:04:31 UTC


*** This bug has been marked as a duplicate of bug 65394 ***

Comment 5 lu hua 2013-07-17 06:51:30 UTC

Verified.Fixed.

Comment 6 Elizabeth 2017-10-06 14:46:04 UTC

Closing old verified.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.