Bug 106385 - [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0
Summary: [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0...
Status: RESOLVED WORKSFORME
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: x86-64 (AMD64) Linux (All)
: high normal
Assignee: Chris Wilson
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged, ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-03 15:58 UTC by Domen
Modified: 2019-01-17 17:11 UTC (History)
3 users (show)

See Also:
i915 platform: GLK
i915 features: GPU hang


Attachments
/sys/class/drm/card0/error (15.25 KB, text/plain)
2018-05-03 15:58 UTC, Domen
no flags Details
dmesg output (54.56 KB, text/plain)
2018-05-03 15:59 UTC, Domen
no flags Details
xrandr output (1.14 KB, text/plain)
2018-05-03 15:59 UTC, Domen
no flags Details
netconsole (50.27 KB, text/plain)
2018-05-04 07:04 UTC, Domen
no flags Details
netconsole1 (38.38 KB, text/plain)
2018-05-04 07:14 UTC, Domen
no flags Details
netconsole (77.63 KB, text/plain)
2018-05-04 07:34 UTC, Domen
no flags Details
netconsole (54.04 KB, text/plain)
2018-05-04 07:59 UTC, Domen
no flags Details
netconsole take 4 (176.60 KB, text/plain)
2018-05-04 09:03 UTC, Domen
no flags Details
dmesg output (61.09 KB, text/plain)
2018-05-16 07:03 UTC, Domen
no flags Details
dmesg (189.10 KB, text/plain)
2018-05-16 11:22 UTC, Domen
no flags Details

Description Domen 2018-05-03 15:58:10 UTC
Created attachment 139316 [details]
/sys/class/drm/card0/error

I guess it must be something with detection of detection of monitors.
Comment 1 Domen 2018-05-03 15:59:07 UTC
Created attachment 139317 [details]
dmesg output
Comment 2 Domen 2018-05-03 15:59:27 UTC
Created attachment 139318 [details]
xrandr output
Comment 3 Chris Wilson 2018-05-03 16:13:24 UTC
It appears that 5s after queuing the initial requests, we haven't even submitted them to HW. Quite distressing!

I see you are running with tip, could you please enable CONFIG_DRM_I915_TRACE_GEM=y and apply something like:

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index df234dc23274..6207bc35a53d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1840,6 +1840,8 @@ void i915_capture_error_state(struct drm_i915_private *i915,
                return;
        }
 
+       GEM_TRACE_DUMP();
+
        i915_error_capture_msg(i915, error, engine_mask, error_msg);
        DRM_INFO("%s\n", error->error_msg);
Comment 4 Chris Wilson 2018-05-03 16:15:50 UTC
Also drm.debug=0xf (everything!) may help try to determine the delay.
Comment 5 Domen 2018-05-04 07:04:03 UTC
Created attachment 139332 [details]
netconsole

Kernel panic, so i had to do netconsole.
Comment 6 Domen 2018-05-04 07:14:37 UTC
Created attachment 139335 [details]
netconsole1

sorry, i added drm.debug
Comment 7 Chris Wilson 2018-05-04 07:18:10 UTC
It should have dumped the trace to netconsole as well. Could you check the console settings?

So it appears that we wake with a ending CS interrupt before we do anything, or that the CS interrupt is too early. Common suspect in this case is IOMMU, could you try intel_iommu=igfx_off?
Comment 8 Domen 2018-05-04 07:34:14 UTC
Created attachment 139336 [details]
netconsole

When i turned CONFIG_DRM_I915_TRACE_GEM, it on different path. So its not calling  GEM_TRACE_DUMP().
I guess now its calling GEM_BUG_ON(), and not i915_capture_error_state().
Comment 9 Chris Wilson 2018-05-04 07:41:02 UTC
GEM_BUG_ON() includes a GEM_TRACE_DUMP; I expect it to show up here :)
Comment 10 Chris Wilson 2018-05-04 07:44:41 UTC
Another quick test is:

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9f3cce022b2d..ff179c967e2a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1738,6 +1738,8 @@ static void enable_execlists(struct intel_engine_cs *engine)
                   engine->status_page.ggtt_offset);
        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
 
+       clear_gtiir(engine);
+
        /* Following the reset, we need to reload the CSB read/write pointers */
        engine->execlists.csb_head = -1;
 }
Comment 11 Domen 2018-05-04 07:59:53 UTC
Created attachment 139340 [details]
netconsole

With line added. I guess ftrace works now.
Comment 12 Chris Wilson 2018-05-04 08:45:09 UTC
In the intel_iommu=igfx_off test, we got

[    0.000000] Linux version 4.17.0-rc3+ (root@amd1.blue.org) (gcc version 7.3.0 (GCC)) #5 SMP PREEMPT Fri May 4 08:44:22 CEST 2018
[    0.000000] Command line: \\k.efi rw drm.debug=0xf intel_iommu=igfx_off initrd=\i.img
...
[    0.258015] DMAR: No ATSR found
[    0.258097] DMAR: dmar0: Using Queued invalidation
[    0.258107] DMAR: dmar1: Using Queued invalidation
[    0.258195] DMAR: Setting RMRR:
[    0.258282] DMAR: Setting identity map for device 0000:00:02.0 [0x5f800000 - 0x7fffffff]
...
[  207.804629] [drm] VT-d active for gfx access

Odd. So we didn't succeed in disabling iommu. Do you mind compiling out iommu entirely to be sure we don't have a problem here with iommu+HWSP?
Comment 13 Domen 2018-05-04 09:03:09 UTC
Created attachment 139341 [details]
netconsole take 4

Now its a bit different.
Comment 14 Chris Wilson 2018-05-09 21:17:05 UTC
Can you please try re-enabling iommu and https://patchwork.freedesktop.org/patch/221513/ ?
Comment 15 Domen 2018-05-09 21:41:30 UTC
(In reply to Chris Wilson from comment #14)
> Can you please try re-enabling iommu and
> https://patchwork.freedesktop.org/patch/221513/ ?

No it doesnt help, 
GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0, action: reset

Do you need debug enabled and console output ?
If you have any other ideas or extra debug flags let me know.
Comment 16 Domen 2018-05-16 07:03:26 UTC
Created attachment 139587 [details]
dmesg output

tried on latest drm-tip
Comment 17 Jani Saarinen 2018-05-16 09:35:54 UTC
Yes, please send dmesg with drm.debug=0x1e log_buf_len=4M.
Comment 18 Domen 2018-05-16 11:22:29 UTC
Created attachment 139590 [details]
dmesg

used drm.debug=0x1e log_buf_len=4M
Comment 19 Lakshmi 2018-09-10 13:37:10 UTC
Chris, any updates on this issue?
Comment 20 Francesco Balestrieri 2019-01-09 08:46:17 UTC
Domen, are you still experiencing this issue?
Comment 21 Chris Wilson 2019-01-17 09:33:04 UTC
Having purchased a glk (celeron N4100) for myself... This looks like to be an isolated incident (well this and the other glk-iommu reported issues!), as annoyingly it worksforme. I was hoping to able to reproduce, sorry.
Comment 22 Domen 2019-01-17 17:09:32 UTC
Sorry, we dont use this board anymore. So i dont know if issue still persits.
Comment 23 Chris Wilson 2019-01-17 17:11:47 UTC
Apologies for the disappointing end.


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.