106185 – [CNL-Y] GPU hang

Bug 106185 - [CNL-Y] GPU hang

Summary: [CNL-Y] GPU hang

Status:	RESOLVED MOVED

Alias:	None

Product:	Mesa
Classification:	Unclassified
Component:	Drivers/DRI/i965 (show other bugs)
Version:	unspecified
Hardware:	Other All

Importance:	low normal
Assignee:	Intel 3D Bugs Mailing List
QA Contact:	Intel 3D Bugs Mailing List

URL:
Whiteboard:	Triaged ReadyForDev
Keywords:

Depends on:
Blocks:

Reported:	2018-04-23 09:12 UTC by Timo Aaltonen
Modified:	2019-09-25 19:10 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	CNL
i915 features:	GPU hang

Attachments
gpu dump (87.05 KB, text/plain) 2018-04-23 09:12 UTC, Timo Aaltonen	Details
dmesg with dinq (2.29 MB, text/plain) 2018-06-01 09:04 UTC, Timo Aaltonen	Details
View All

Description Timo Aaltonen 2018-04-23 09:12:23 UTC

Created attachment 138994 [details]
gpu dump

I'm seeing hangs if the GPU is enabled on a CNL-Y SDP (UMCY2SNEP). I've also suspected general HW failure, but after re-seating the CPU and torturing it with stock-ish 4.15 on Ubuntu bionic I'm not able to hang the machine.

Attached is one GPU dump I was able to get with a -nightly kernel from Apr 18. Mesa version is 18.0.0.

Comment 1 Jani Saarinen 2018-04-30 07:13:51 UTC

Have you tried with latest drm-tip: https://cgit.freedesktop.org/drm-tip and please send dmesg from boot using drm.debug=0x1e log_buf_len=4M

Comment 2 Timo Aaltonen 2018-05-08 07:44:05 UTC

Yes, the system hang is apparently fixed by:

Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Thu Apr 12 17:58:02 2018 +0300

    drm/i915/cnl: Use mmio access to context status buffer


but I was told the remaining GPU hang was reproduced by your devs, so I'm now waiting for fixes.

Comment 3 Chris Wilson 2018-05-11 15:44:48 UTC

(In reply to Timo Aaltonen from comment #2)
> Yes, the system hang is apparently fixed by:
> 
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Thu Apr 12 17:58:02 2018 +0300
> 
>     drm/i915/cnl: Use mmio access to context status buffer
> 

Now replaced by

commit 77dfedb5be03779f9a5d83e323a1b36e32090105
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Fri May 11 13:11:45 2018 +0100

    drm/i915/execlists: Use rmb() to order CSB reads
    
    We assume that the CSB is written using the normal ringbuffer
    coherency protocols, as outlined in kernel/events/ring_buffer.c:
    
        *   (HW)                              (DRIVER)
        *
        *   if (LOAD ->data_tail) {            LOAD ->data_head
        *                      (A)             smp_rmb()       (C)
        *      STORE $data                     LOAD $data
        *      smp_wmb()       (B)             smp_mb()        (D)
        *      STORE ->data_head               STORE ->data_tail
        *   }
    
    So we assume that the HW fulfils its ordering requirements (B), and so
    we should use a complimentary rmb (C) to ensure that our read of its
    WRITE pointer is completed before we start accessing the data.
    
    The final mb (D) is implied by the uncached mmio we perform to inform
    the HW of our READ pointer.
    
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105064
    References: https://bugs.freedesktop.org/show_bug.cgi?id=105888
    References: https://bugs.freedesktop.org/show_bug.cgi?id=106185
    Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP")
    References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer")
    Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Cc: Michał Winiarski <michal.winiarski@intel.com>
    Cc: Rafael Antognolli <rafael.antognolli@intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Timo Aaltonen <tjaalton@ubuntu.com>
    Tested-by: Timo Aaltonen <tjaalton@ubuntu.com>
    Acked-by: Michel Thierry <michel.thierry@intel.com>
    Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk

Comment 4 Jani Saarinen 2018-05-13 07:49:17 UTC

Timo, does this change help any?

Comment 5 Chris Wilson 2018-05-13 08:25:21 UTC

(In reply to Jani Saarinen from comment #4)
> Timo, does this change help any?

The GPU hang reported isn't actually anything to do with the aforementioned patches. It is uncertain whether it is a missing w/a in the kernel or mesa.

Comment 6 Chris Wilson 2018-05-13 08:26:09 UTC

(In reply to Timo Aaltonen from comment #2)
> I was told the remaining GPU hang was reproduced by your devs, so I'm
> now waiting for fixes.

For the record, who? We can ask them for a status update...

Comment 7 Timo Aaltonen 2018-05-30 07:24:48 UTC

bah, the patch from #3 certainly helped but seems it's not enough after all.. I can still hang the machine but it takes longer now

Comment 8 Timo Aaltonen 2018-05-30 13:16:48 UTC

oh well, now with dinq (and not 4.15+backports) it doesn't hard hang anymore, so my backport is somewhat incomplete even with 77dfedb5b added.

GPU hangs/widget corruption happen still, though.

Comment 9 Simon Lee 2018-05-31 16:48:32 UTC

(In reply to Timo Aaltonen from comment #8)
> oh well, now with dinq (and not 4.15+backports) it doesn't hard hang
> anymore, so my backport is somewhat incomplete even with 77dfedb5b added.
> 
> GPU hangs/widget corruption happen still, though.

Do you have an updated log for the GPU hang/widget corruption?

Comment 10 Timo Aaltonen 2018-06-01 09:04:39 UTC

Created attachment 139900 [details]
dmesg with dinq

dmesg dump attached

it's easy to reproduce; open gnome-control-center, click through the left side config panel until graphics start to get corrupt, then after a while there will be a GPU hang and possibly an xserver crash
 
and apparently after logging back in, saving the logfile to a usb-stick and letting it idle for a while the machine hung hard again.. so that's not fixed either :/

Comment 11 Chris Wilson 2018-08-31 13:51:11 UTC

To get some more attention from the people in the know...

Comment 12 GitLab Migration User 2019-09-25 19:10:52 UTC

-- GitLab Migration Automatic Message --

This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity.

You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1718.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.