Bug 74100

Summary: [SNB Bisected]igt/gem_reset_stats/close-pending-fork-render causes system hang
Product: DRI Reporter: lu hua <huax.lu>
Component: DRM/IntelAssignee: Mika Kuoppala <mika.kuoppala>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: critical    
Priority: highest CC: intel-gfx-bugs
Version: unspecified   
Hardware: All   
OS: Linux (All)   
Whiteboard:
i915 platform: i915 features:
Attachments:
Description Flags
dmesg
none
new dmesg
none
hang preventer hack
none
fix up semaphore hangcheck code
none
dmesg none

Description lu hua 2014-01-27 02:46:33 UTC
Created attachment 92828 [details]
dmesg

System Environment:
--------------------------
Platform: Sandybridge
Kernel(drm-intel-nightly)f27f16540be56813df2ebb8e1106dd5c258f07c3

Bug detailed description:
-------------------------
It causes system hang on sandybridge with -nightly, -queued and -fixes kernel.

Bsiect shows: igt commit c05c88c2b641aaab83608fb2c8e816893690c1fe is the first bad commit.
commit c05c88c2b641aaab83608fb2c8e816893690c1fe
Author:     Mika Kuoppala <mika.kuoppala@linux.intel.com>
AuthorDate: Tue Jan 21 17:40:08 2014 +0200
Commit:     Daniel Vetter <daniel.vetter@ffwll.ch>
CommitDate: Wed Jan 22 09:45:27 2014 +0100

    tests/gem_reset_stats: stop only one ring when submitting hang

    If we stop all the rings, we can end up blaming the innocent
    rings on hangcheck.

    Reference: https://bugs.freedesktop.org/show_bug.cgi?id=73652
    Signed-off-by: Mika Kuoppala <mika.kuoppala@intel.com>
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>

output:
IGT-Version: 1.5-gb5109e6 (x86_64) (Linux: 3.13.0-rc8_drm-intel-nightly_f27f16_20140126+ x86_64)
Subtest close-pending-fork-render: SUCCESS
Test requirement not met in function gem_require_ring, file ./../lib/drmtest.h:311:
Last errno: 11, Resource temporarily unavailable
Test requirement: (!(gem_has_vebox(fd)))
Subtest reset-stats-vebox: SKIP
Subtest reset-stats-ctx-vebox: SKIP
Subtest ban-vebox: SKIP
Subtest ban-ctx-vebox: SKIP
Subtest reset-count-vebox: SKIP
Subtest reset-count-ctx-vebox: SKIP
Subtest unrelated-ctx-vebox: SKIP
Subtest close-pending-vebox: SKIP
Subtest close-pending-ctx-vebox: SKIP
Subtest close-pending-fork-vebox: SKIP

Reproduce steps:
-------------------------
1. ./gem_reset_stats --run-subtest close-pending-fork-render
Comment 1 Chris Wilson 2014-01-31 16:59:44 UTC
Can you please repaste the error message after applying:

commit 48ad03ca0c5f078b8d12a64323fd93b3858041af
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Jan 31 16:56:01 2014 +0000

    lib: Capture errno on entry
    
    When printing the errno, it is important that we capture the user errno
    before we make any library calls - as they may alter the value.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=74007
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 2 Mika Kuoppala 2014-02-05 14:34:29 UTC
please retest with latest igt and drm-intel-nightly
Comment 3 lu hua 2014-02-07 08:29:45 UTC
Created attachment 93589 [details]
new dmesg

It still causes system hang on latest igt and -nightly kernel.
output:
IGT-Version: 1.5-g0269d1d (x86_64) (Linux: 3.13.0_drm-intel-nightly_8f5284_20140207+ x86_64)
Subtest close-pending-fork-render: SUCCESS
Test requirement not met in function gem_require_ring, file ./../lib/drmtest.h:312:
Last errno: 11, Resource temporarily unavailable
Test requirement: (!(gem_has_vebox(fd)))
Subtest reset-stats-vebox: SKIP
Subtest reset-stats-ctx-vebox: SKIP
Subtest ban-vebox: SKIP
Subtest ban-ctx-vebox: SKIP
Subtest reset-count-vebox: SKIP
Subtest reset-count-ctx-vebox: SKIP
Subtest unrelated-ctx-vebox: SKIP
Subtest close-pending-vebox: SKIP
Subtest close-pending-ctx-vebox: SKIP
Subtest close-pending-fork-vebox: SKIP
Comment 4 Daniel Vetter 2014-02-11 11:40:32 UTC
Ok, I've cleaned up the superflous SKIP behaviour and some other stuff in the testcase. And I can repro the issue here. Looking at netconsole the system seems to go down with a failed gpu reset:

[   63.823960] [drm] stuck on render ring
[   63.824130] [drm] GPU crash dump saved to /sys/class/drm/card0/error
[   63.824847] [drm] GPU HANG [e77fffff]
[   63.824893] [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace.
[   63.824979] [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel
[   63.825057] [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue.
[   63.825138] [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it.
[   63.825978] [drm] Simulated gpu hang, resetting stop_rings
[   65.805203] [drm] Enabling RC6 states: RC6 on, RC6p off, RC6pp off
[   71.812051] [drm] stuck on blitter ring
[   72.316616] [drm:i915_reset] *ERROR* Failed to reset chip: -110

That shouldn't really happen ...
Comment 5 Mika Kuoppala 2014-02-12 13:33:37 UTC
The two render ring hangs are a proper hangs injected by test but the
third one in blitter is unexpected. The recovery from this hang fails
as attempting to reset the GPU hard hangs the whole machine.

The SNB I tested with hangs immediately on setting the reset bit.

As the hang recovery works properly with other tests, I will decrease importance.
Comment 6 Daniel Vetter 2014-03-07 20:31:37 UTC
With this patch here

http://patchwork.freedesktop.org/patch/21673/

I can convert the hard system hang in a failed gpu hang. Can you please test this patch and confirm that it improves the situation? gem_reset_stat should still fail, but the machine will survive at least.
Comment 7 lu hua 2014-03-10 06:03:23 UTC
(In reply to comment #6)
> With this patch here
> 
> http://patchwork.freedesktop.org/patch/21673/
> 
> I can convert the hard system hang in a failed gpu hang. Can you please test
> this patch and confirm that it improves the situation? gem_reset_stat should
> still fail, but the machine will survive at least.

Patch fail.
Comment 8 Daniel Vetter 2014-03-10 09:23:49 UTC
On Mon, Mar 10, 2014 at 7:03 AM,  <bugzilla-daemon@freedesktop.org> wrote:
> Patch fail.


Please provide more details to the nature of the failure, this
information is next to useless.
Comment 9 lu hua 2014-03-11 05:51:50 UTC
patching file drivers/gpu/drm/i915/intel_uncore.c
Hunk #1 FAILED at 989.
1 out of 1 hunk FAILED -- saving rejects to file drivers/gpu/drm/i915/intel_uncore.c.rej

drivers/gpu/drm/i915/intel_uncore.c :
        if (fw_engine)
                dev_priv->uncore.funcs.force_wake_get(dev_priv, fw_engine);

        if (IS_GEN6(dev) || IS_GEN7(dev))
                dev_priv->uncore.fifo_count =
                        __raw_i915_read32(dev_priv, GTFIFOCTL) &
                        GT_FIFO_FREE_ENTRIES_MASK;

        spin_unlock_irqrestore(&dev_priv->uncore.lock, irqflags);
        return ret;
}

int intel_gpu_reset(struct drm_device *dev)

patch:
@@ -989,9 +989,11 @@  static int gen6_do_reset(struct drm_device *dev)
 	if (fw_engine)
 		dev_priv->uncore.funcs.force_wake_get(dev_priv, fw_engine);
 
-	if (IS_GEN6(dev) || IS_GEN7(dev))
-		WARN_ON((__raw_i915_read32(dev_priv, GTFIFOCTL) &
-			 GT_FIFO_FREE_ENTRIES_MASK) != 0);
+	if (IS_GEN6(dev) || IS_GEN7(dev)) {
+		if (WARN_ON((__raw_i915_read32(dev_priv, GTFIFOCTL) &
+			     GT_FIFO_FREE_ENTRIES_MASK) != 0))
+		    ret = -EIO;
+	}
 
 	dev_priv->uncore.fifo_count = 0;
Comment 10 Daniel Vetter 2014-03-11 10:00:32 UTC
My apologies for not updating this bug, I've already ripped out the offending code. The system hang is still there though.

Reping to Mika to just look into disabling this specific subtest on snb.
Comment 11 Daniel Vetter 2014-03-14 22:13:38 UTC
Created attachment 95821 [details] [review]
hang preventer hack

I think I'm onto something here. Please test the attached patch and check whether the hangs disappear. Note that there might be some additional test failures with this (since it rips out a bit of code), the important part is whether the snb still hangs or not.
Comment 12 Daniel Vetter 2014-03-14 23:46:18 UTC
Created attachment 95822 [details] [review]
fix up semaphore hangcheck code

Now also a real patch. Please test both this patch and the earlier quick hack, thanks.
Comment 13 lu hua 2014-03-17 08:26:14 UTC
Created attachment 95913 [details]
dmesg

(In reply to comment #12)
> Created attachment 95822 [details] [review] [review]
> fix up semaphore hangcheck code
> 
> Now also a real patch. Please test both this patch and the earlier quick
> hack, thanks.

Test this patch, It still occurs.
Comment 14 Daniel Vetter 2014-03-26 19:05:47 UTC
Mika, can you please create a patch to just skip the offending tests on snb?
Comment 15 Daniel Vetter 2014-04-29 16:03:10 UTC
Please test the below patch:

http://patchwork.freedesktop.org/patch/25173/
Comment 16 lu hua 2014-04-30 05:20:29 UTC
The hang goes away on latest -nightly and -fixes kernel. Close it.
output:
IGT-Version: 1.6-gc1404e0 (x86_64) (Linux: 3.15.0-rc2_drm-intel-fixes_7f1950_20140429+ x86_64)
Subtest close-pending-fork-render: SUCCESS
Test requirement not met in function gem_require_ring, file ioctl_wrappers.c:820:
Last errno: 0, Success
Test requirement: (!(gem_has_vebox(fd)))
Comment 17 lu hua 2014-04-30 05:21:00 UTC
Verified.Fixed.
Comment 18 Mika Kuoppala 2014-05-05 11:09:48 UTC
As I failed to trigger the bug without Daniel's patch I proceed to bisect the significant commit. It turned out to be:

commit 5582e8c3c49150c0e7398688b5ed167d6c3d44fd
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed Apr 9 09:19:41 2014 +0100

    drm/i915: Preserve ring buffers objects across resume
Comment 19 Jari Tahvanainen 2016-10-12 10:21:33 UTC
Closing verified+fixed.
Comment 20 Jari Tahvanainen 2016-10-12 10:21:33 UTC
Closing verified+fixed.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.