106385 – [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0

Bug 106385 - [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0

Summary: [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	x86-64 (AMD64) Linux (All)

Importance:	high normal
Assignee:	Chris Wilson
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-05-03 15:58 UTC by Domen
Modified:	2019-01-17 17:11 UTC (History)
CC List:	3 users (show)

See Also:
i915 platform:	GLK
i915 features:	GPU hang

Attachments
/sys/class/drm/card0/error (15.25 KB, text/plain) 2018-05-03 15:58 UTC, Domen	no flags	Details
dmesg output (54.56 KB, text/plain) 2018-05-03 15:59 UTC, Domen	no flags	Details
xrandr output (1.14 KB, text/plain) 2018-05-03 15:59 UTC, Domen	no flags	Details
netconsole (50.27 KB, text/plain) 2018-05-04 07:04 UTC, Domen	no flags	Details
netconsole1 (38.38 KB, text/plain) 2018-05-04 07:14 UTC, Domen	no flags	Details
netconsole (77.63 KB, text/plain) 2018-05-04 07:34 UTC, Domen	no flags	Details
netconsole (54.04 KB, text/plain) 2018-05-04 07:59 UTC, Domen	no flags	Details
netconsole take 4 (176.60 KB, text/plain) 2018-05-04 09:03 UTC, Domen	no flags	Details
dmesg output (61.09 KB, text/plain) 2018-05-16 07:03 UTC, Domen	no flags	Details
dmesg (189.10 KB, text/plain) 2018-05-16 11:22 UTC, Domen	no flags	Details
View All

Description Domen 2018-05-03 15:58:10 UTC

Created attachment 139316 [details]
/sys/class/drm/card0/error

I guess it must be something with detection of detection of monitors.

Comment 1 Domen 2018-05-03 15:59:07 UTC

Created attachment 139317 [details]
dmesg output

Comment 2 Domen 2018-05-03 15:59:27 UTC

Created attachment 139318 [details]
xrandr output

Comment 3 Chris Wilson 2018-05-03 16:13:24 UTC

It appears that 5s after queuing the initial requests, we haven't even submitted them to HW. Quite distressing!

I see you are running with tip, could you please enable CONFIG_DRM_I915_TRACE_GEM=y and apply something like:

diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index df234dc23274..6207bc35a53d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1840,6 +1840,8 @@ void i915_capture_error_state(struct drm_i915_private *i915,
                return;
        }
 
+       GEM_TRACE_DUMP();
+
        i915_error_capture_msg(i915, error, engine_mask, error_msg);
        DRM_INFO("%s\n", error->error_msg);

Comment 4 Chris Wilson 2018-05-03 16:15:50 UTC

Also drm.debug=0xf (everything!) may help try to determine the delay.

Comment 5 Domen 2018-05-04 07:04:03 UTC

Created attachment 139332 [details]
netconsole

Kernel panic, so i had to do netconsole.

Comment 6 Domen 2018-05-04 07:14:37 UTC

Created attachment 139335 [details]
netconsole1

sorry, i added drm.debug

Comment 7 Chris Wilson 2018-05-04 07:18:10 UTC

It should have dumped the trace to netconsole as well. Could you check the console settings?

So it appears that we wake with a ending CS interrupt before we do anything, or that the CS interrupt is too early. Common suspect in this case is IOMMU, could you try intel_iommu=igfx_off?

Comment 8 Domen 2018-05-04 07:34:14 UTC

Created attachment 139336 [details]
netconsole

When i turned CONFIG_DRM_I915_TRACE_GEM, it on different path. So its not calling  GEM_TRACE_DUMP().
I guess now its calling GEM_BUG_ON(), and not i915_capture_error_state().

Comment 9 Chris Wilson 2018-05-04 07:41:02 UTC

GEM_BUG_ON() includes a GEM_TRACE_DUMP; I expect it to show up here :)

Comment 10 Chris Wilson 2018-05-04 07:44:41 UTC

Another quick test is:

diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9f3cce022b2d..ff179c967e2a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1738,6 +1738,8 @@ static void enable_execlists(struct intel_engine_cs *engine)
                   engine->status_page.ggtt_offset);
        POSTING_READ(RING_HWS_PGA(engine->mmio_base));
 
+       clear_gtiir(engine);
+
        /* Following the reset, we need to reload the CSB read/write pointers */
        engine->execlists.csb_head = -1;
 }

Comment 11 Domen 2018-05-04 07:59:53 UTC

Created attachment 139340 [details]
netconsole

With line added. I guess ftrace works now.

Comment 12 Chris Wilson 2018-05-04 08:45:09 UTC

In the intel_iommu=igfx_off test, we got

[    0.000000] Linux version 4.17.0-rc3+ (root@amd1.blue.org) (gcc version 7.3.0 (GCC)) #5 SMP PREEMPT Fri May 4 08:44:22 CEST 2018
[    0.000000] Command line: \\k.efi rw drm.debug=0xf intel_iommu=igfx_off initrd=\i.img
...
[    0.258015] DMAR: No ATSR found
[    0.258097] DMAR: dmar0: Using Queued invalidation
[    0.258107] DMAR: dmar1: Using Queued invalidation
[    0.258195] DMAR: Setting RMRR:
[    0.258282] DMAR: Setting identity map for device 0000:00:02.0 [0x5f800000 - 0x7fffffff]
...
[  207.804629] [drm] VT-d active for gfx access

Odd. So we didn't succeed in disabling iommu. Do you mind compiling out iommu entirely to be sure we don't have a problem here with iommu+HWSP?

Comment 13 Domen 2018-05-04 09:03:09 UTC

Created attachment 139341 [details]
netconsole take 4

Now its a bit different.

Comment 14 Chris Wilson 2018-05-09 21:17:05 UTC

Can you please try re-enabling iommu and https://patchwork.freedesktop.org/patch/221513/ ?

Comment 15 Domen 2018-05-09 21:41:30 UTC

(In reply to Chris Wilson from comment #14)
> Can you please try re-enabling iommu and
> https://patchwork.freedesktop.org/patch/221513/ ?

No it doesnt help, 
GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0, action: reset

Do you need debug enabled and console output ?
If you have any other ideas or extra debug flags let me know.

Comment 16 Domen 2018-05-16 07:03:26 UTC

Created attachment 139587 [details]
dmesg output

tried on latest drm-tip

Comment 17 Jani Saarinen 2018-05-16 09:35:54 UTC

Yes, please send dmesg with drm.debug=0x1e log_buf_len=4M.

Comment 18 Domen 2018-05-16 11:22:29 UTC

Created attachment 139590 [details]
dmesg

used drm.debug=0x1e log_buf_len=4M

Comment 19 Lakshmi 2018-09-10 13:37:10 UTC

Chris, any updates on this issue?

Comment 20 Francesco Balestrieri 2019-01-09 08:46:17 UTC

Domen, are you still experiencing this issue?

Comment 21 Chris Wilson 2019-01-17 09:33:04 UTC

Having purchased a glk (celeron N4100) for myself... This looks like to be an isolated incident (well this and the other glk-iommu reported issues!), as annoyingly it worksforme. I was hoping to able to reproduce, sorry.

Comment 22 Domen 2019-01-17 17:09:32 UTC

Sorry, we dont use this board anymore. So i dont know if issue still persits.

Comment 23 Chris Wilson 2019-01-17 17:11:47 UTC

Apologies for the disappointing end.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.