Bug 111014 - [regression] [bisected] i915 GPU HANG: ecode 7:1:0xfffffffe on Kernel 5.1.x and 5.2rc1 to 5.2rc6
Summary: [regression] [bisected] i915 GPU HANG: ecode 7:1:0xfffffffe on Kernel 5.1.x a...
Status: RESOLVED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: medium major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: Triaged
Keywords:
: 110652 110800 110812 110816 110834 110858 110860 110867 110912 110969 110985 (view as bug list)
Depends on:
Blocks:
 
Reported: 2019-06-27 10:57 UTC by dirkneukirchen
Modified: 2019-07-04 14:09 UTC (History)
6 users (show)

See Also:
i915 platform: IVB
i915 features: GPU hang


Attachments
git bisect log (2.74 KB, text/plain)
2019-06-27 10:57 UTC, dirkneukirchen
no flags Details
output of sysfs error file (31.31 KB, text/plain)
2019-06-27 10:59 UTC, dirkneukirchen
no flags Details
output sysfs Kernel 5.2rc5 (28.38 KB, text/plain)
2019-06-27 10:59 UTC, dirkneukirchen
no flags Details
output sysfs Kernel 5.2rc4 (39.85 KB, text/plain)
2019-06-27 10:59 UTC, dirkneukirchen
no flags Details
output sysfs Kernel 5.2rc3 (32.17 KB, text/plain)
2019-06-27 11:00 UTC, dirkneukirchen
no flags Details
output sysfs Kernel 5.1rc1 (28.93 KB, text/plain)
2019-06-27 11:00 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.2rc6 (98.95 KB, text/plain)
2019-06-27 11:02 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.2rc5 (98.23 KB, text/x-log)
2019-06-27 11:02 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.2rc4 (99.04 KB, text/x-log)
2019-06-27 11:02 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.2rc3 (98.78 KB, text/x-log)
2019-06-27 11:03 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.1rc1 (98.27 KB, text/plain)
2019-06-27 11:04 UTC, dirkneukirchen
no flags Details
dmesg error Kernel 5.1.12 (99.01 KB, text/plain)
2019-06-27 11:04 UTC, dirkneukirchen
no flags Details
output sysfs Kernel 5.1.12 (29.11 KB, text/plain)
2019-06-27 11:05 UTC, dirkneukirchen
no flags Details
dmesg error Kernel f2253bd9859b - bad commit in bisect log (97.02 KB, text/plain)
2019-06-27 11:07 UTC, dirkneukirchen
no flags Details
output sysfs Kernel f2253bd9859b - bad commit in bisect log (30.11 KB, text/plain)
2019-06-27 11:08 UTC, dirkneukirchen
no flags Details

Description dirkneukirchen 2019-06-27 10:57:25 UTC
Created attachment 144657 [details]
git bisect log

Error Description:
Since Kernel 5.1.x i had several GPU Hangs with my Hardware.
Typically when playing Video in mpv or using Chromium-Browser.
GPU Hang results in visible Lag/short hang/ not updating of the Desktop-UI (KDE).

Regression because:
Using LTS Kernel 4.9.x does not have theses issues with the same userspace.
5.0.x didnt have these issues either iirc

System Hardware:

- CPU: Intel 3770 
- Mainboard: Intel DZ77RE-75K
- Dual Monitor (HDMI and mini-Displayport)

OS: Arch Linux 
with linux , linux-mainline, linux-lts packages,
a custom linux-bisect AUR package to test versions locally

I hope I didnt make an error with bisection.

Bisect Log output -> attachment
other dmesg/sysfs error txt -> attachment
Comment 1 dirkneukirchen 2019-06-27 10:59:09 UTC
Created attachment 144658 [details]
output of sysfs error file
Comment 2 dirkneukirchen 2019-06-27 10:59:38 UTC
Created attachment 144659 [details]
output sysfs Kernel 5.2rc5
Comment 3 dirkneukirchen 2019-06-27 10:59:57 UTC
Created attachment 144660 [details]
output sysfs Kernel 5.2rc4
Comment 4 dirkneukirchen 2019-06-27 11:00:18 UTC
Created attachment 144661 [details]
output sysfs Kernel 5.2rc3
Comment 5 dirkneukirchen 2019-06-27 11:00:41 UTC
Created attachment 144662 [details]
output sysfs Kernel 5.1rc1
Comment 6 Chris Wilson 2019-06-27 11:01:15 UTC
Ok, we'll do before and after! Hopefully everyone will be happy!
Comment 7 dirkneukirchen 2019-06-27 11:02:11 UTC
Created attachment 144663 [details]
dmesg error Kernel 5.2rc6
Comment 8 dirkneukirchen 2019-06-27 11:02:31 UTC
Created attachment 144664 [details]
dmesg error Kernel 5.2rc5
Comment 9 Chris Wilson 2019-06-27 11:02:40 UTC
Please try:

diff --git a/drivers/gpu/drm/i915/gt/intel_ringbuffer.c b/drivers/gpu/drm/i915/gt/intel_ringbuffer.c
index 81f9b0422e6a..f11ba6da4d1d 100644
--- a/drivers/gpu/drm/i915/gt/intel_ringbuffer.c
+++ b/drivers/gpu/drm/i915/gt/intel_ringbuffer.c
@@ -1811,6 +1811,11 @@ static int ring_request_alloc(struct i915_request *request)
        if (ret)
                return ret;
 
+       /* Once again for Ivybridge after updating 3D state. */
+       ret = request->engine->emit_flush(request, EMIT_INVALIDATE);
+       if (ret)
+               return ret;
+
        request->reserved_space -= LEGACY_REQUEST_SIZE;
        return 0;
 }
Comment 10 dirkneukirchen 2019-06-27 11:02:58 UTC
Created attachment 144665 [details]
dmesg error Kernel 5.2rc4
Comment 11 dirkneukirchen 2019-06-27 11:03:17 UTC
Created attachment 144666 [details]
dmesg error Kernel 5.2rc3
Comment 12 dirkneukirchen 2019-06-27 11:04:12 UTC
Created attachment 144667 [details]
dmesg error Kernel 5.1rc1
Comment 13 dirkneukirchen 2019-06-27 11:04:40 UTC
Created attachment 144668 [details]
dmesg error Kernel 5.1.12
Comment 14 dirkneukirchen 2019-06-27 11:05:11 UTC
Created attachment 144669 [details]
output sysfs Kernel 5.1.12
Comment 15 dirkneukirchen 2019-06-27 11:07:50 UTC
Created attachment 144670 [details]
dmesg error Kernel f2253bd9859b - bad commit in bisect log
Comment 16 dirkneukirchen 2019-06-27 11:08:20 UTC
Created attachment 144671 [details]
output sysfs Kernel f2253bd9859b - bad commit in bisect log
Comment 17 Chris Wilson 2019-06-27 11:08:35 UTC
Waitasec, in upstream, we invalidate before the switch. Could you please check with 5.2 to see if it is already fixed?
Comment 18 Chris Wilson 2019-06-27 11:10:56 UTC
See commit 928f8f42310f244501a7c70daac82c196112c190
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 19 12:17:47 2019 +0100

    drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context
    
    Despite what I think the prm recommends, commit f2253bd9859b
    ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out
    to be a huge mistake when enabling Ironlake contexts as the GPU would
    hang on either a MI_FLUSH or PIPE_CONTROL immediately following the
    MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g.
    simple rendercopies with igt, do not suffer).
Comment 19 dirkneukirchen 2019-06-27 12:13:03 UTC
(In reply to Chris Wilson from comment #17)
> Waitasec, in upstream, we invalidate before the switch. Could you please
> check with 5.2 to see if it is already fixed?

it happens in 5.2rc1 to the latest 5.2-rc6 too - see the attached files of the dmesg output and the sysfs file

i attached 5.1.12 (so after release), 5.1-rc1 of 5.1 series
AND
various 5.2-rcX messages/error sysfs files
and a log+sysfs from the first bad commit I found during bisect

PS: drivers/gpu/drm/i915/gt/intel_ringbuffer.c does not exist in Linux 5.2-rc6 so I cannot test that - and it does not seem to apply if i try to patch the i915/intel_ringbuffer.c that is there
Comment 20 Chris Wilson 2019-06-27 12:19:00 UTC
You need our version of 5.2 ;) https://cgit.freedesktop.org/drm-tip
Comment 21 dirkneukirchen 2019-06-27 19:12:13 UTC
(In reply to Chris Wilson from comment #20)
> You need our version of 5.2 ;) https://cgit.freedesktop.org/drm-tip

Thank you. I modified my PKGBUILD and created a new kernel pkg with that patch file applied (because the source tree has the patched version).

reports now as 5.2.0-rc6-bisect-g44b3a556c682 

Preliminary: the patch seems to fix the issue / it doesnt occur w. that patched kernel variant

Detail: 4 hours uptime with some video playback loop in mpv and active chromium w. youtube is fine - the error was manifesting earlier in all other cases in my logs but I will run it a little while longer to be sure; I didnt test the kernel without that patch
Comment 22 Denis 2019-06-27 22:23:35 UTC
Hello, could you please clarify how many RAM do you have on your PC? Asking because looks like we also could reproduce it, but only when removed 4 GB memory (4 left). On monday will check suggested by Chris kernel version.
Comment 23 dirkneukirchen 2019-06-28 05:01:14 UTC
(In reply to Denis from comment #22)
> Hello, could you please clarify how many RAM do you have on your PC? 

8GB RAM

running currently with kernel cmdline (also see dmesg): rw verbose sysrq_always_enabled audit=0 intel_iommu=on,igfx_off

I think active IOMMU (for VT-d) isnt normally enabled.
Also using 2 monitors (see dmesg) with

xrandr --listmonitors
Monitors: 2
 0: +*HDMI-2 1680/474x1050/296+1920+0  HDMI-2
 1: +HDMI-1 1920/521x1080/293+0+0  HDMI-1


> Preliminary: the patch seems to fix the issue

Final: the patch seems to fix the issue / it doesnt occur w. that patched kernel variant

After now little more than 13 hours running video in a loop there is no error related to this bug.
Comment 24 Chris Wilson 2019-07-02 19:21:02 UTC
On its way back to v5.1:

commit c84c9029d782a3a0d2a7f0522ecb907314d43e2c
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri Apr 19 12:17:47 2019 +0100

    drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context
    
    Despite what I think the prm recommends, commit f2253bd9859b
    ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out
    to be a huge mistake when enabling Ironlake contexts as the GPU would
    hang on either a MI_FLUSH or PIPE_CONTROL immediately following the
    MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g.
    simple rendercopies with igt, do not suffer).
    
    Ville found the following clue,
    
      "[DevCTG+]: For the invalidate operation of the pipe control, the
       following pointers are affected. The
       invalidate operation affects the restore of these packets. If the pipe
       control invalidate operation is completed
       before the context save, the indirect pointers will not be restored from
       memory.
       1. Pipeline State Pointer
       2. Media State Pointer
       3. Constant Buffer Packet"
    
    which suggests by us emitting the INVALIDATE prior to the MI_SET_CONTEXT,
    we prevent the context-restore from chasing the dangling pointers within
    the image, and explains why this likely prevents the GPU hang.
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Reviewed-by: Ville Syrjälä <ville.syrjala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20190419111749.3910-1-chris@chris-wilson.co.uk
    (cherry picked from commit 928f8f42310f244501a7c70daac82c196112c190 in drm-intel-next)
    Cc: stable@vger.kernel.org
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=111014
    Fixes: f2253bd9859b ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context")
    Signed-off-by: Daniel Vetter <daniel.vetter@ffwll.ch>
Comment 25 Chris Wilson 2019-07-02 19:22:26 UTC
*** Bug 110912 has been marked as a duplicate of this bug. ***
Comment 26 Chris Wilson 2019-07-02 19:22:36 UTC
*** Bug 110969 has been marked as a duplicate of this bug. ***
Comment 27 Chris Wilson 2019-07-02 19:22:52 UTC
*** Bug 110985 has been marked as a duplicate of this bug. ***
Comment 28 Chris Wilson 2019-07-02 19:23:35 UTC
*** Bug 110867 has been marked as a duplicate of this bug. ***
Comment 29 Chris Wilson 2019-07-02 19:23:48 UTC
*** Bug 110860 has been marked as a duplicate of this bug. ***
Comment 30 Chris Wilson 2019-07-02 19:24:01 UTC
*** Bug 110858 has been marked as a duplicate of this bug. ***
Comment 31 Chris Wilson 2019-07-02 19:24:13 UTC
*** Bug 110834 has been marked as a duplicate of this bug. ***
Comment 32 Chris Wilson 2019-07-02 19:25:28 UTC
*** Bug 110816 has been marked as a duplicate of this bug. ***
Comment 33 Chris Wilson 2019-07-02 19:25:44 UTC
*** Bug 110812 has been marked as a duplicate of this bug. ***
Comment 34 Chris Wilson 2019-07-02 19:26:01 UTC
*** Bug 110800 has been marked as a duplicate of this bug. ***
Comment 35 Chris Wilson 2019-07-02 19:26:17 UTC
*** Bug 110652 has been marked as a duplicate of this bug. ***
Comment 36 Paul 2019-07-03 12:42:06 UTC
Hi Chris
I've checked the issues (which you've marked as duplicates) on the 5.2.0 version of drm-tip - the issue isn't reproduced.
Also, I've checked the issue on the previous commit of the drm-tip, the issue is also not relevant .
Therefore, the issue was fixed with one of the previous commits or its some kind of complex fix.
Comment 37 Dmitry Osipenko 2019-07-04 14:09:03 UTC
(In reply to Chris Wilson from comment #18)
> See commit 928f8f42310f244501a7c70daac82c196112c190
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Fri Apr 19 12:17:47 2019 +0100
> 
>     drm/i915/ringbuffer: EMIT_INVALIDATE *before* switch context
>     
>     Despite what I think the prm recommends, commit f2253bd9859b
>     ("drm/i915/ringbuffer: EMIT_INVALIDATE after switch context") turned out
>     to be a huge mistake when enabling Ironlake contexts as the GPU would
>     hang on either a MI_FLUSH or PIPE_CONTROL immediately following the
>     MI_SET_CONTEXT of an active mesa context (more vanilla contexts, e.g.
>     simple rendercopies with igt, do not suffer).

This fixes super-annoying hangs on Ivy Bridge with v5.1.15. Thanks for the pointer!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.