Created attachment 130189 [details] output from /sys/class/drm/card0/error GPU hangs during 'normal' computer usage. Cannot reproduce, happens approximately ever hour or two. Desktop System: Linux 4.11.0-rc1-torvalds #6 SMP x86_64 x86_64 GNU/Linux Primary applications running: Firefox, Terminator, Emacs, Rhythmbox Edited dmesg output [drm] GPU HANG: ecode 7:0:0xf3cffffe, in compiz [1167], reason: Hang on render ring, action: reset [drm] GPU hangs can indicate a bug anywhere in the entire gfx stack, including userspace. [drm] Please file a _new_ bug report on bugs.freedesktop.org against DRI -> DRM/Intel [drm] drm/i915 developers can then reassign to the right component if it's not a kernel issue. [drm] The gpu crash dump is required to analyze gpu hangs, so please always attach it. [drm] GPU crash dump saved to /sys/class/drm/card0/error drm/i915: Resetting chip after gpu hang Bug present while running all mainline kernels since about 4.10-rc8. Also present when running Ubuntu kernel 4.4.0-65-generic.
Created attachment 130190 [details] Complete dmesg output since boot
Bug is reproducible by building a kernel with threading set to utilize all cpu's. ie all cpu's are running at 100% load.
Kernel build does not trigger bug 100% of the time.
*** Bug 100214 has been marked as a duplicate of this bug. ***
*** Bug 100315 has been marked as a duplicate of this bug. ***
*** Bug 100347 has been marked as a duplicate of this bug. ***
commit 5d4bac5503fcc67dd7999571e243cee49371aef7 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Wed Mar 22 20:59:30 2017 +0000 drm/i915: Restore marking context objects as dirty on pinning Commit e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc") converted the legacy intel_ringbuffer submission to the same context pinning mechanism as execlists - that is to pin the context until the subsequent request is retired. Previously it used the vma retirement of the context object to keep itself pinned until the next request (after i915_vma_move_to_active()). In the conversion, I missed that the vma retirement was also responsible for marking the object as dirty. Mark the context object as dirty when pinning (equivalent to execlists) which ensures that if the context is swapped out due to mempressure or suspend/hibernation, when it is loaded back in it does so with the previous state (and not all zero). Fixes: e8a9c58fcd9a ("drm/i915: Unify active context tracking between legacy/execlists/guc") Reported-by: Dennis Gilmore <dennis@ausil.us> Reported-by: Mathieu Marquer <mathieu.marquer@gmail.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=99993 Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100181 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Cc: <drm-intel-fixes@lists.freedesktop.org> # v4.11-rc1 Link: http://patchwork.freedesktop.org/patch/msgid/20170322205930.12762-1-chris@chris-wilson.co.uk Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
Created attachment 130478 [details] cat /sys/class/drm/card0/error This is not fixed in -rc4, which includes said patch. I append /sys/class/drm/card0/error from my gpu hang under load (compiling stuff). Again. 4.10 is clearly *good*. This is introduced in 4.11-rc1. thanks martin
-rc4 is *better* than rc3, so this patch clearly improved the situation. -rc4 is still worse than 4.10 as I experienced a gpu hang again.
Created attachment 130479 [details] just the next example of /sys/class/drm/card0/error
-rc4 doesn't include that patch.
*** Bug 100521 has been marked as a duplicate of this bug. ***
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.