Created attachment 138994 [details] gpu dump I'm seeing hangs if the GPU is enabled on a CNL-Y SDP (UMCY2SNEP). I've also suspected general HW failure, but after re-seating the CPU and torturing it with stock-ish 4.15 on Ubuntu bionic I'm not able to hang the machine. Attached is one GPU dump I was able to get with a -nightly kernel from Apr 18. Mesa version is 18.0.0.
Have you tried with latest drm-tip: https://cgit.freedesktop.org/drm-tip and please send dmesg from boot using drm.debug=0x1e log_buf_len=4M
Yes, the system hang is apparently fixed by: Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Thu Apr 12 17:58:02 2018 +0300 drm/i915/cnl: Use mmio access to context status buffer but I was told the remaining GPU hang was reproduced by your devs, so I'm now waiting for fixes.
(In reply to Timo Aaltonen from comment #2) > Yes, the system hang is apparently fixed by: > > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Thu Apr 12 17:58:02 2018 +0300 > > drm/i915/cnl: Use mmio access to context status buffer > Now replaced by commit 77dfedb5be03779f9a5d83e323a1b36e32090105 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri May 11 13:11:45 2018 +0100 drm/i915/execlists: Use rmb() to order CSB reads We assume that the CSB is written using the normal ringbuffer coherency protocols, as outlined in kernel/events/ring_buffer.c: * (HW) (DRIVER) * * if (LOAD ->data_tail) { LOAD ->data_head * (A) smp_rmb() (C) * STORE $data LOAD $data * smp_wmb() (B) smp_mb() (D) * STORE ->data_head STORE ->data_tail * } So we assume that the HW fulfils its ordering requirements (B), and so we should use a complimentary rmb (C) to ensure that our read of its WRITE pointer is completed before we start accessing the data. The final mb (D) is implied by the uncached mmio we perform to inform the HW of our READ pointer. References: https://bugs.freedesktop.org/show_bug.cgi?id=105064 References: https://bugs.freedesktop.org/show_bug.cgi?id=105888 References: https://bugs.freedesktop.org/show_bug.cgi?id=106185 Fixes: 767a983ab255 ("drm/i915/execlists: Read the context-status HEAD from the HWSP") References: 61bf9719fa17 ("drm/i915/cnl: Use mmio access to context status buffer") Suggested-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: Michał Winiarski <michal.winiarski@intel.com> Cc: Rafael Antognolli <rafael.antognolli@intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Timo Aaltonen <tjaalton@ubuntu.com> Tested-by: Timo Aaltonen <tjaalton@ubuntu.com> Acked-by: Michel Thierry <michel.thierry@intel.com> Acked-by: Mika Kuoppala <mika.kuoppala@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180511121147.31915-1-chris@chris-wilson.co.uk
Timo, does this change help any?
(In reply to Jani Saarinen from comment #4) > Timo, does this change help any? The GPU hang reported isn't actually anything to do with the aforementioned patches. It is uncertain whether it is a missing w/a in the kernel or mesa.
(In reply to Timo Aaltonen from comment #2) > I was told the remaining GPU hang was reproduced by your devs, so I'm > now waiting for fixes. For the record, who? We can ask them for a status update...
bah, the patch from #3 certainly helped but seems it's not enough after all.. I can still hang the machine but it takes longer now
oh well, now with dinq (and not 4.15+backports) it doesn't hard hang anymore, so my backport is somewhat incomplete even with 77dfedb5b added. GPU hangs/widget corruption happen still, though.
(In reply to Timo Aaltonen from comment #8) > oh well, now with dinq (and not 4.15+backports) it doesn't hard hang > anymore, so my backport is somewhat incomplete even with 77dfedb5b added. > > GPU hangs/widget corruption happen still, though. Do you have an updated log for the GPU hang/widget corruption?
Created attachment 139900 [details] dmesg with dinq dmesg dump attached it's easy to reproduce; open gnome-control-center, click through the left side config panel until graphics start to get corrupt, then after a while there will be a GPU hang and possibly an xserver crash and apparently after logging back in, saving the logfile to a usb-stick and letting it idle for a while the machine hung hard again.. so that's not fixed either :/
To get some more attention from the people in the know...
-- GitLab Migration Automatic Message -- This bug has been migrated to freedesktop.org's GitLab instance and has been closed from further activity. You can subscribe and participate further through the new bug through this link to our GitLab instance: https://gitlab.freedesktop.org/mesa/mesa/issues/1718.
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.