Bug 89001 - [SKL]Time out and system reboot fails while running IGT cases: gem_ringfill/render, gem_ringfill/render-interruptible
Summary: [SKL]Time out and system reboot fails while running IGT cases: gem_ringfill/r...
Status: CLOSED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: All Linux (All)
: high major
Assignee: cprigent
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2015-02-06 05:20 UTC by fangxun
Modified: 2015-11-14 10:52 UTC (History)
4 users (show)

See Also:
i915 platform: SKL
i915 features: GEM/Other


Attachments
dmes file (124.84 KB, text/plain)
2015-02-06 05:20 UTC, fangxun
no flags Details

Description fangxun 2015-02-06 05:20:35 UTC
Created attachment 113215 [details]
dmes file

==System Environment==
--------------------------
Regression: not sure

Non-working platforms: SKL

==kernel==
--------------------------
drm-intel-nightly/9583cb

==Bug detailed description==
-----------------------------
Time out while running IGT cases: gem_ringfill/render, gem_ringfill/render-interruptible. System failed to reboot after that.


Reproduce Steps
==============
./gem_ringfill --run-subtest render
./gem_ringfill --run-subtest render-interruptible
Comment 1 Jesse Barnes 2015-03-10 20:23:52 UTC
Michel, have you seen this one?  It's hard to capture logs since the system hangs pretty hard, but I saw one that was a bad io access in the iowrite32 in intel_logical_ring_emit() which sent me searching for our virtual_start mapping setup.  That led me to something like this:

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index fcb074b..bc97457 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -504,8 +504,11 @@ static int execlists_context_queue(struct intel_engine_cs *
        unsigned long flags;
        int num_elements = 0;
 
-       if (to != ring->default_context)
-               intel_lr_context_pin(ring, to);
+       if (to != ring->default_context) {
+               ret = intel_lr_context_pin(ring, to);
+               if (ret)
+                       return ret;
+       }
 
        if (!request) {
                /*
@@ -802,13 +805,16 @@ intel_logical_ring_advance_and_submit(struct intel_ringbuf
                                      struct drm_i915_gem_request *request)
 {
        struct intel_engine_cs *ring = ringbuf->ring;
+       int ret;
 
        intel_logical_ring_advance(ringbuf);
 
        if (intel_ring_stopped(ring))
                return;
 
-       execlists_context_queue(ring, ctx, ringbuf->tail, request);
+       ret = execlists_context_queue(ring, ctx, ringbuf->tail, request);
+       if (ret)
+               DRM_ERROR("execlist context queue failed: %d\n", ret);
 }
 
 static int intel_lr_context_pin(struct intel_engine_cs *ring,

but that's not sufficient to fix this bug.  It does seem important that we check these return values though.

And this failure may indicate something wrong with the lrc handling code, I'm not sure.  Some additional, custom kernel debug code would probably help narrow things down.
Comment 2 Michel Thierry 2015-03-11 10:06:14 UTC
Those tests pass in BDW, so there must be something we need to change for SKL. I'll try to find one in the office.
Comment 3 Tvrtko Ursulin 2015-03-11 17:04:49 UTC
Command submission hang with "reset button does not work" is something I've been experiencing "forever" on my SKL.

In my case reset button actually works but with ~20 second delay (same with power off).

And I was reproducing it with gem_exec_nop, or actually any other submission but much less frequently. So even any IGT can hang since it does a submission on startup.

I was able to get occasional lockdep traces over serial when it happens, but extremely rarely, and they would point to seemingly impossible locking scenarios. Can try and dig them out if we think it is the same bug.
Comment 4 Chris Harris 2015-03-13 18:28:04 UTC
This may be a duplicate of https://bugs.freedesktop.org/show_bug.cgi?id=88865 and may be fixed by the 'OLR removal' patch set.
Comment 5 Jesse Barnes 2015-07-29 14:35:55 UTC
Assigning to QA for duplication; could be fixed already or hidden by the ringfill hard hangs #90854.
Comment 6 Humberto Israel Perez Rodriguez 2015-11-13 16:18:43 UTC
i've tested the following tests cases with drm-intel-testing and nightly and on both kernels the tests passed on SKL-Y

Test cases tested :
./gem_ringfill --run-subtest render
./gem_ringfill --run-subtest render-interruptible


Kernel : latest drm-intel-nightly: 2015y-11m-06d-12h-48m-02s UTC integration manifest
commit a3b0dec82fdb59c629c4fb9847245b80b0cf69dd
Author: Jani Nikula <jani.nikula@intel.com>
Date:   Fri Nov 6 14:48:23 2015 +0200

Kernel : latest drm-intel-testing (4.3.0-rc6-testing)
commit 87074657f22e38163e712ca417e1a398d00096b6
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Fri Oct 23 11:56:52 2015 +0200


Software configuration :
--------------------------------
Ubuntu 14.04.03 x86_64
Xserver : 1.17.4  (commit : 2c7fa2a)
libdrm : 2.4.65 (commit :c349616)
Xf86-video-intel : 2.99.917 (commit : baec802)
Mesa : 11.0.4 (commit : 31bf247)
Libva : 1.6.1 (commit : 613eb96)
Intel-driver : 1.6.1 (commit : 35858c6)
Cairo : 1.14.4 (commit : 0317ee7)


 --- Hardware information ---
CPU information : Intel(R) Core(TM) m5-6Y57 CPU @ 1.10GHz
GPU Card  : Intel Corporation Device 191e (rev 07) (prog-if 00 [VGA controller])
Bios    : 102.0
KSC   : 1.15
Memory ram  : 4 GB



So i will proceed to close this bug as fixed, if in the future this bug is needed please reopen it
Comment 7 cprigent 2015-11-14 10:52:16 UTC
So closed


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.