Summary: | [regression] [bisected] i915 GPU HANG: ecode 7:1:0xfffffffe on Kernel 5.1.x and 5.2rc1 to 5.2rc6 | ||
---|---|---|---|
Product: | DRI | Reporter: | dirkneukirchen |
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Status: | RESOLVED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | major | ||
Priority: | medium | CC: | apreiml, bero, intel-gfx-bugs, jshand2013, mikhail.v.gavrilov, pkozlov.vrn |
Version: | unspecified | ||
Hardware: | x86-64 (AMD64) | ||
OS: | Linux (All) | ||
Whiteboard: | Triaged | ||
i915 platform: | IVB | i915 features: | GPU hang |
Attachments: |
Created attachment 144658 [details]
output of sysfs error file
Created attachment 144659 [details]
output sysfs Kernel 5.2rc5
Created attachment 144660 [details]
output sysfs Kernel 5.2rc4
Created attachment 144661 [details]
output sysfs Kernel 5.2rc3
Created attachment 144662 [details]
output sysfs Kernel 5.1rc1
Ok, we'll do before and after! Hopefully everyone will be happy! Created attachment 144663 [details]
dmesg error Kernel 5.2rc6
Created attachment 144664 [details]
dmesg error Kernel 5.2rc5
Please try: diff --git a/drivers/gpu/drm/i915/gt/intel_ringbuffer.c b/drivers/gpu/drm/i915/gt/intel_ringbuffer.c index 81f9b0422e6a..f11ba6da4d1d 100644 --- a/drivers/gpu/drm/i915/gt/intel_ringbuffer.c +++ b/drivers/gpu/drm/i915/gt/intel_ringbuffer.c @@ -1811,6 +1811,11 @@ static int ring_request_alloc(struct i915_request *request) if (ret) return ret; + /* Once again for Ivybridge after updating 3D state. */ + ret = request->engine->emit_flush(request, EMIT_INVALIDATE); + if (ret) + return ret; + request->reserved_space -= LEGACY_REQUEST_SIZE; return 0; } Created attachment 144665 [details]
dmesg error Kernel 5.2rc4
Created attachment 144666 [details]
dmesg error Kernel 5.2rc3
Created attachment 144667 [details]
dmesg error Kernel 5.1rc1
Created attachment 144668 [details]
dmesg error Kernel 5.1.12
Created attachment 144669 [details]
output sysfs Kernel 5.1.12
Created attachment 144670 [details]
dmesg error Kernel f2253bd9859b - bad commit in bisect log
Created attachment 144671 [details]
output sysfs Kernel f2253bd9859b - bad commit in bisect log
Waitasec, in upstream, we invalidate before the switch. Could you please check with 5.2 to see if it is already fixed? See commit 928f8f42310f244501a7c70daac82c196112c190 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 19 12:17:47 2019 +0100 drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context Despite what I think the prm recommends, commit f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out to be a huge mistake when enabling Ironlake contexts as the GPU would hang on either a MI_FLUSH or PIPE_CONTROL immediately following the MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g. simple rendercopies with igt, do not suffer). (In reply to Chris Wilson from comment #17) > Waitasec, in upstream, we invalidate before the switch. Could you please > check with 5.2 to see if it is already fixed? it happens in 5.2rc1 to the latest 5.2-rc6 too - see the attached files of the dmesg output and the sysfs file i attached 5.1.12 (so after release), 5.1-rc1 of 5.1 series AND various 5.2-rcX messages/error sysfs files and a log+sysfs from the first bad commit I found during bisect PS: drivers/gpu/drm/i915/gt/intel_ringbuffer.c does not exist in Linux 5.2-rc6 so I cannot test that - and it does not seem to apply if i try to patch the i915/intel_ringbuffer.c that is there You need our version of 5.2 ;) https://cgit.freedesktop.org/drm-tip (In reply to Chris Wilson from comment #20) > You need our version of 5.2 ;) https://cgit.freedesktop.org/drm-tip Thank you. I modified my PKGBUILD and created a new kernel pkg with that patch file applied (because the source tree has the patched version). reports now as 5.2.0-rc6-bisect-g44b3a556c682 Preliminary: the patch seems to fix the issue / it doesnt occur w. that patched kernel variant Detail: 4 hours uptime with some video playback loop in mpv and active chromium w. youtube is fine - the error was manifesting earlier in all other cases in my logs but I will run it a little while longer to be sure; I didnt test the kernel without that patch Hello, could you please clarify how many RAM do you have on your PC? Asking because looks like we also could reproduce it, but only when removed 4 GB memory (4 left). On monday will check suggested by Chris kernel version. (In reply to Denis from comment #22) > Hello, could you please clarify how many RAM do you have on your PC? 8GB RAM running currently with kernel cmdline (also see dmesg): rw verbose sysrq_always_enabled audit=0 intel_iommu=on,igfx_off I think active IOMMU (for VT-d) isnt normally enabled. Also using 2 monitors (see dmesg) with xrandr --listmonitors Monitors: 2 0: +*HDMI-2 1680/474x1050/296+1920+0 HDMI-2 1: +HDMI-1 1920/521x1080/293+0+0 HDMI-1 > Preliminary: the patch seems to fix the issue Final: the patch seems to fix the issue / it doesnt occur w. that patched kernel variant After now little more than 13 hours running video in a loop there is no error related to this bug. On its way back to v5.1: commit c84c9029d782a3a0d2a7f0522ecb907314d43e2c Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Apr 19 12:17:47 2019 +0100 drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context Despite what I think the prm recommends, commit f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out to be a huge mistake when enabling Ironlake contexts as the GPU would hang on either a MI_FLUSH or PIPE_CONTROL immediately following the MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g. simple rendercopies with igt, do not suffer). Ville found the following clue, "[DevCTG+]: For the invalidate operation of the pipe control, the following pointers are affected. The invalidate operation affects the restore of these packets. If the pipe control invalidate operation is completed before the context save, the indirect pointers will not be restored from memory. 1. Pipeline State Pointer 2. Media State Pointer 3. Constant Buffer Packet" which suggests by us emitting the INVALIDATE prior to the MI_SET_CONTEXT, we prevent the context-restore from chasing the dangling pointers within the image, and explains why this likely prevents the GPU hang. Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190419111749.3910-1-chris@chris-wilson.co.uk (cherry picked from commit 928f8f42310f244501a7c70daac82c196112c190 in drm-intel-next) Cc: stable@vger.kernel.org Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111014 Fixes: f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch> *** Bug 110912 has been marked as a duplicate of this bug. *** *** Bug 110969 has been marked as a duplicate of this bug. *** *** Bug 110985 has been marked as a duplicate of this bug. *** *** Bug 110867 has been marked as a duplicate of this bug. *** *** Bug 110860 has been marked as a duplicate of this bug. *** *** Bug 110858 has been marked as a duplicate of this bug. *** *** Bug 110834 has been marked as a duplicate of this bug. *** *** Bug 110816 has been marked as a duplicate of this bug. *** *** Bug 110812 has been marked as a duplicate of this bug. *** *** Bug 110800 has been marked as a duplicate of this bug. *** *** Bug 110652 has been marked as a duplicate of this bug. *** Hi Chris I've checked the issues (which you've marked as duplicates) on the 5.2.0 version of drm-tip - the issue isn't reproduced. Also, I've checked the issue on the previous commit of the drm-tip, the issue is also not relevant . Therefore, the issue was fixed with one of the previous commits or its some kind of complex fix. (In reply to Chris Wilson from comment #18) > See commit 928f8f42310f244501a7c70daac82c196112c190 > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Apr 19 12:17:47 2019 +0100 > > drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context > > Despite what I think the prm recommends, commit f2253bd9859b > ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out > to be a huge mistake when enabling Ironlake contexts as the GPU would > hang on either a MI_FLUSH or PIPE_CONTROL immediately following the > MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g. > simple rendercopies with igt, do not suffer). This fixes super-annoying hangs on Ivy Bridge with v5.1.15. Thanks for the pointer! |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 144657 [details] git bisect log Error Description: Since Kernel 5.1.x i had several GPU Hangs with my Hardware. Typically when playing Video in mpv or using Chromium-Browser. GPU Hang results in visible Lag/short hang/ not updating of the Desktop-UI (KDE). Regression because: Using LTS Kernel 4.9.x does not have theses issues with the same userspace. 5.0.x didnt have these issues either iirc System Hardware: - CPU: Intel 3770 - Mainboard: Intel DZ77RE-75K - Dual Monitor (HDMI and mini-Displayport) OS: Arch Linux with linux , linux-mainline, linux-lts packages, a custom linux-bisect AUR package to test versions locally I hope I didnt make an error with bisection. Bisect Log output -> attachment other dmesg/sysfs error txt -> attachment