| Summary: | [GLK iommu-vs-execlists] GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0 | ||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Product: | DRI | Reporter: | Domen <domen.stangar> | ||||||||||||||||||||||
| Component: | DRM/Intel | Assignee: | Chris Wilson <chris> | ||||||||||||||||||||||
| Status: | RESOLVED WORKSFORME | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||||||||
| Severity: | normal | ||||||||||||||||||||||||
| Priority: | high | CC: | chris, intel-gfx-bugs, yunying.sun | ||||||||||||||||||||||
| Version: | DRI git | ||||||||||||||||||||||||
| Hardware: | x86-64 (AMD64) | ||||||||||||||||||||||||
| OS: | Linux (All) | ||||||||||||||||||||||||
| Whiteboard: | Triaged, ReadyForDev | ||||||||||||||||||||||||
| i915 platform: | GLK | i915 features: | GPU hang | ||||||||||||||||||||||
| Attachments: |
|
||||||||||||||||||||||||
Created attachment 139317 [details]
dmesg output
Created attachment 139318 [details]
xrandr output
It appears that 5s after queuing the initial requests, we haven't even submitted them to HW. Quite distressing!
I see you are running with tip, could you please enable CONFIG_DRM_I915_TRACE_GEM=y and apply something like:
diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c
index df234dc23274..6207bc35a53d 100644
--- a/drivers/gpu/drm/i915/i915_gpu_error.c
+++ b/drivers/gpu/drm/i915/i915_gpu_error.c
@@ -1840,6 +1840,8 @@ void i915_capture_error_state(struct drm_i915_private *i915,
return;
}
+ GEM_TRACE_DUMP();
+
i915_error_capture_msg(i915, error, engine_mask, error_msg);
DRM_INFO("%s\n", error->error_msg);
Also drm.debug=0xf (everything!) may help try to determine the delay. Created attachment 139332 [details]
netconsole
Kernel panic, so i had to do netconsole.
Created attachment 139335 [details]
netconsole1
sorry, i added drm.debug
It should have dumped the trace to netconsole as well. Could you check the console settings? So it appears that we wake with a ending CS interrupt before we do anything, or that the CS interrupt is too early. Common suspect in this case is IOMMU, could you try intel_iommu=igfx_off? Created attachment 139336 [details]
netconsole
When i turned CONFIG_DRM_I915_TRACE_GEM, it on different path. So its not calling GEM_TRACE_DUMP().
I guess now its calling GEM_BUG_ON(), and not i915_capture_error_state().
GEM_BUG_ON() includes a GEM_TRACE_DUMP; I expect it to show up here :) Another quick test is:
diff --git a/drivers/gpu/drm/i915/intel_lrc.c b/drivers/gpu/drm/i915/intel_lrc.c
index 9f3cce022b2d..ff179c967e2a 100644
--- a/drivers/gpu/drm/i915/intel_lrc.c
+++ b/drivers/gpu/drm/i915/intel_lrc.c
@@ -1738,6 +1738,8 @@ static void enable_execlists(struct intel_engine_cs *engine)
engine->status_page.ggtt_offset);
POSTING_READ(RING_HWS_PGA(engine->mmio_base));
+ clear_gtiir(engine);
+
/* Following the reset, we need to reload the CSB read/write pointers */
engine->execlists.csb_head = -1;
}
Created attachment 139340 [details]
netconsole
With line added. I guess ftrace works now.
In the intel_iommu=igfx_off test, we got [ 0.000000] Linux version 4.17.0-rc3+ (root@amd1.blue.org) (gcc version 7.3.0 (GCC)) #5 SMP PREEMPT Fri May 4 08:44:22 CEST 2018 [ 0.000000] Command line: \\k.efi rw drm.debug=0xf intel_iommu=igfx_off initrd=\i.img ... [ 0.258015] DMAR: No ATSR found [ 0.258097] DMAR: dmar0: Using Queued invalidation [ 0.258107] DMAR: dmar1: Using Queued invalidation [ 0.258195] DMAR: Setting RMRR: [ 0.258282] DMAR: Setting identity map for device 0000:00:02.0 [0x5f800000 - 0x7fffffff] ... [ 207.804629] [drm] VT-d active for gfx access Odd. So we didn't succeed in disabling iommu. Do you mind compiling out iommu entirely to be sure we don't have a problem here with iommu+HWSP? Created attachment 139341 [details]
netconsole take 4
Now its a bit different.
Can you please try re-enabling iommu and https://patchwork.freedesktop.org/patch/221513/ ? (In reply to Chris Wilson from comment #14) > Can you please try re-enabling iommu and > https://patchwork.freedesktop.org/patch/221513/ ? No it doesnt help, GPU HANG: ecode 9:0:0xfffffffe, reason: hang on rcs0, bcs0, vcs0, vecs0, action: reset Do you need debug enabled and console output ? If you have any other ideas or extra debug flags let me know. Created attachment 139587 [details]
dmesg output
tried on latest drm-tip
Yes, please send dmesg with drm.debug=0x1e log_buf_len=4M. Created attachment 139590 [details]
dmesg
used drm.debug=0x1e log_buf_len=4M
Chris, any updates on this issue? Domen, are you still experiencing this issue? Having purchased a glk (celeron N4100) for myself... This looks like to be an isolated incident (well this and the other glk-iommu reported issues!), as annoyingly it worksforme. I was hoping to able to reproduce, sorry. Sorry, we dont use this board anymore. So i dont know if issue still persits. Apologies for the disappointing end. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 139316 [details] /sys/class/drm/card0/error I guess it must be something with detection of detection of monitors.