Bug 106185

Summary: [CNL-Y] GPU hang
Product: Mesa Reporter: Timo Aaltonen <tjaalton>
Component: Drivers/DRI/i965Assignee: Intel 3D Bugs Mailing List <intel-3d-bugs>
Status: RESOLVED MOVED QA Contact: Intel 3D Bugs Mailing List <intel-3d-bugs>
Severity: normal    
Priority: low CC: intel-gfx-bugs, rafael.antognolli
Version: unspecified   
Hardware: Other   
OS: All   
Whiteboard: Triaged ReadyForDev
i915 platform: CNL i915 features: GPU hang
Attachments: gpu dump
dmesg with dinq

Description Timo Aaltonen 2018-04-23 09:12:23 UTC
Created attachment 138994 [details]
gpu dump

I'm seeing hangs if the GPU is enabled on a CNL-Y SDP (UMCY2SNEP). I've also suspected general HW failure, but after re-seating the CPU and torturing it with stock-ish 4.15 on Ubuntu bionic I'm not able to hang the machine.

Attached is one GPU dump I was able to get with a -nightly kernel from Apr 18. Mesa version is 18.0.0.
Comment 1 Jani Saarinen 2018-04-30 07:13:51 UTC
Have you tried with latest drm-tip: https://cgit.freedesktop.org/drm-tip and please send dmesg from boot using drm.debug=0x1e log_buf_len=4M
Comment 2 Timo Aaltonen 2018-05-08 07:44:05 UTC
Yes, the system hang is apparently fixed by:

Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Apr 12 17:58:02 2018 +0300

    drm/i915/cnl: Use mmio access to context status buffer


but I was told the remaining GPU hang was reproduced by your devs, so I'm now waiting for fixes.
Comment 3 Chris Wilson 2018-05-11 15:44:48 UTC
(In reply to Timo Aaltonen from comment #2)
> Yes, the system hang is apparently fixed by:
> 
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Thu Apr 12 17:58:02 2018 +0300
> 
>     drm/i915/cnl: Use mmio access to context status buffer
> 

Now replaced by

commit 77dfedb5be03779f9a5d83e323a1b36e32090105
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 13:11:45 2018 +0100

    drm/i915/execlists: Use rmb() to order CSB reads
    
    We assume that the CSB is written using the normal ringbuffer
    coherency protocols, as outlined in kernel/events/ring_buffer.c:
    
        *   (HW)                              (DRIVER)
        *
        *   if (LOAD ->data_tail) {            LOAD ->data_head
        *                      (A)             smp_rmb()       (C)
        *      STORE $data                     LOAD $data
        *      smp_wmb()       (B)             smp_mb()        (D)
        *      STORE ->data_head               STORE ->data_tail
        *   }
    
    So we assume that the HW fulfils its ordering requirements (B), and so
    we should use a complimentary rmb (C) to ensure that our read of its
    WRITE pointer is completed before we start accessing the data.
    
    The final mb (D) is implied by the uncached mmio we perform to inform
    the HW of our READ pointer.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105064
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105888
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106185
    Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP")
    References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer")
    Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: MichaƂ Winiarski <michal.winiarski@intel.com>
    Cc: Rafael Antognolli <rafael.antognolli@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Timo Aaltonen <tjaalton@ubuntu.com>
    Tested-by: Timo Aaltonen <tjaalton@ubuntu.com>
    Acked-by: Michel Thierry <michel.thierry@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk
Comment 4 Jani Saarinen 2018-05-13 07:49:17 UTC
Timo, does this change help any?
Comment 5 Chris Wilson 2018-05-13 08:25:21 UTC
(In reply to Jani Saarinen from comment #4)
> Timo, does this change help any?

The GPU hang reported isn't actually anything to do with the aforementioned patches. It is uncertain whether it is a missing w/a in the kernel or mesa.
Comment 6 Chris Wilson 2018-05-13 08:26:09 UTC
(In reply to Timo Aaltonen from comment #2)
> I was told the remaining GPU hang was reproduced by your devs, so I'm
> now waiting for fixes.

For the record, who? We can ask them for a status update...
Comment 7 Timo Aaltonen 2018-05-30 07:24:48 UTC
bah, the patch from #3 certainly helped but seems it's not enough after all.. I can still hang the machine but it takes longer now
Comment 8 Timo Aaltonen 2018-05-30 13:16:48 UTC
oh well, now with dinq (and not 4.15+backports) it doesn't hard hang anymore, so my backport is somewhat incomplete even with 77dfedb5b added.

GPU hangs/widget corruption happen still, though.
Comment 9 Simon Lee 2018-05-31 16:48:32 UTC
(In reply to Timo Aaltonen from comment #8)
> oh well, now with dinq (and not 4.15+backports) it doesn't hard hang
> anymore, so my backport is somewhat incomplete even with 77dfedb5b added.
> 
> GPU hangs/widget corruption happen still, though.

Do you have an updated log for the GPU hang/widget corruption?
Comment 10 Timo Aaltonen 2018-06-01 09:04:39 UTC
Created attachment 139900 [details]
dmesg with dinq

dmesg dump attached

it's easy to reproduce; open gnome-control-center, click through the left side config panel until graphics start to get corrupt, then after a while there will be a GPU hang and possibly an xserver crash
 
and apparently after logging back in, saving the logfile to a usb-stick and letting it idle for a while the machine hung hard again.. so that's not fixed either :/
Comment 11 Chris Wilson 2018-08-31 13:51:11 UTC
To get some more attention from the people in the know...
Comment 12 GitLab Migration User 2019-09-25 19:10:52 UTC
-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1718.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.