Created attachment 135791 [details] Dmesg output with trace. After a couple GPU hangs, I get the attached trace output. Sometimes the machine gets totally frozen, although I couldn't reproduce that with the serial cable attached. drm-tip from today has the issue: 5144438448829ec2a3d94fd16a9e69a52cfa7b3b Author: Tvrtko Ursulin <tvrtko.ursulin@intel.com> Date: Tue Nov 28 17:05:46 2017 +0000 drm-tip: 2017y-11m-28d-17h-04m-56s UTC integration manifest And this wouldn't happen on the one from last week: 42670ce69ef7a03b53bea25fad60a9b3931402cf Author: Lionel Landwerlin <lionel.g.landwerlin@intel.com> Date: Mon Nov 20 16:11:13 2017 +0000 drm-tip: 2017y-11m-20d-16h-09m-47s UTC integration manifest
The dmesg is incomplete; lost by systemd not keeping up. Off the top of my head, we've made no changes around ppgtt in the last week. I'm guessing a bisect will head off into -rc1. Please do bisect this if possible, and try to capture the dmesg, maybe disable ftrace-dump-on-oops.
I would also recommend enabling kasan.
Created attachment 135809 [details] Dmesg output. For some reason I couldn't reproduce it with kasan enabled. This last log was with drm-tip from today: commit 8f873adc152c899d0e1b9ccaf9caa468955c1fdd (HEAD -> wip/cnl, drm-tip/drm-tip, tip/2017-11-29) Author: Daniel Vetter <daniel.vetter@ffwll.ch> Date: Wed Nov 29 17:29:54 2017 +0100 drm-tip: 2017y-11m-29d-16h-28m-40s UTC integration manifest
Wowser; that's pretty random explosions. If not kasan, perhaps some of the other memdebug options will catch something. I hope that system is expendable. Bisection is becoming increasingly invaluable; something is severely amiss, and I'm 95% certain it's not us!
I'm sorry, I can't afford to bisect this :( If anyone is interested, it should be pretty trivial to reproduce with upstream mesa and something like: $ PIGLIT_PLATFORM=gbm ./piglit run -t spec@ext_framebuffer_multisample@accuracy gpu piglit_results No Xorg required. I'm rolling back to 4.14 for now.
my bisect with "./gem_concurrent_all --run-subtest 4KiB-tiny-partial-gtt-early-read-forked-hang-all" pointed to this: commit 4e90a6e222720dd0ec529f87eca990c736ba8ede Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Sun Nov 26 22:09:01 2017 +0000 drm/i915: Record default HW state in the GPU error state Reverting it on drm-tip I could get test passing. Chris, any quick idea? Maybe concurrence issue here? Rafael will still check if same patch is really what he saw...
Try diff --git a/drivers/gpu/drm/i915/i915_gpu_error.c b/drivers/gpu/drm/i915/i915_gpu_error.c index e07b5247cd96..c69bd0a8c48c 100644 --- a/drivers/gpu/drm/i915/i915_gpu_error.c +++ b/drivers/gpu/drm/i915/i915_gpu_error.c @@ -1413,6 +1413,7 @@ capture_object(struct drm_i915_private *dev_priv, if (obj && i915_gem_object_has_pages(obj)) { struct i915_vma fake = { .node = { .start = U64_MAX, .size = obj->base.size }, + .size = obj->base.size; .pages = obj->mm.pages, .obj = obj, };
(And that 5%!)
with .size defined (s/;/,) on top of drm-tip ./gem_concurrent_all runs fine for me...
piglit also seems happy with that patch. Thanks for the quick fix.
commit b5e0a9418e09a7b6df1728a26832c7c34aa1adf8 Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Dec 1 00:15:36 2017 +0000 drm/i915: Set fake_vma.size as well as fake_vma.node.size for capture When capturing the bo, we allocate an error object with an array of min(vma->size, vma->node.size) pages, plus a bit for compression overhead. However, when creating the fake vma to describe the bo, only one of the sizes was filled in, resulting in a too small array. Through my and CI testing, this was sufficient for the mostly empty NULL context as it compressed well (or the out-of-bounds access simply didn't cause an issue). However, in real workloads on Cannonlake, we were overflowing that array and causing havoc with the random memory corruption. Reported-by: Rafael Antognolli <rafael.antognolli@intel.com> Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=103964 Fixes: 4e90a6e22272 ("drm/i915: Record default HW state in the GPU error state") Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Tested-by: Rodrigo Vivi <rodrigo.vivi@gmail.com> Link: https://patchwork.freedesktop.org/patch/msgid/20171201001536.13941-1-chris@chris-wilson.co.uk Reviewed-by: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.