Summary: | [skl][bisected] GPU Hangs on drm-intel-nightly | ||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Product: | DRI | Reporter: | Mike Lothian <mike> | ||||||||||||||||
Component: | DRM/Intel | Assignee: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> | ||||||||||||||||
Severity: | normal | ||||||||||||||||||
Priority: | medium | CC: | artjom.simon, intel-gfx-bugs, mike | ||||||||||||||||
Version: | XOrg git | Keywords: | bisected | ||||||||||||||||
Hardware: | Other | ||||||||||||||||||
OS: | All | ||||||||||||||||||
Whiteboard: | |||||||||||||||||||
i915 platform: | SKL | i915 features: | |||||||||||||||||
Attachments: |
|
Created attachment 127337 [details]
Error
Please feel free to ignore this, forgot I was testing GuC, doesn't seem to happen when this is disabled - though it's still a regression compared to the other kernels I've been using Spoke too soon, it just seems to take a bit longer with GuC disabled and X didn't crash Created attachment 127338 [details]
Dmesg with GuC disabled
Created attachment 127339 [details]
Error
DMAR telltale right before the hang. "Unknown error" fantastic. Try intel_iommu=igfx_off I tried that for a few hours and didn't get any locks ups dup of 89360? Have you been able to reproduce this? I've bisected this as well as I can and got: axion intel # git bisect good d209b9c3cd281e4543e1150d173388b6d8f29a42 is the first bad commit commit d209b9c3cd281e4543e1150d173388b6d8f29a42 Author: Michał Winiarski <michal.winiarski@intel.com> Date: Thu Oct 13 14:02:41 2016 +0200 drm/i915/gtt: Split gen8_ppgtt_clear_pte_range Let's use more top-down approach, where each gen8_ppgtt_clear_* function is responsible for clearing the struct passed as an argument and calling relevant clear_range functions on lower-level tables. Doing this rather than operating on PTE ranges makes the implementation of shrinking page tables quite simple. v2: Drop min when calculating num_entries, no negation in 48b ppgtt check, no newlines in vars block (Joonas) Cc: Chris Wilson <chris@chris-wilson.co.uk> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Michel Thierry <michel.thierry@intel.com> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Signed-off-by: Michał Winiarski <michal.winiarski@intel.com> Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Link: http://patchwork.freedesktop.org/patch/msgid/1476360162-24062-2-git-send-email-michal.winiarski@intel.com :040000 040000 3d4b545bc817fdcb2b262dd33d3f95cc2f7612b2 4ef823c1499311cc8d4374ca20133e24ebbd323a M drivers As this bug doesn't always manifest itself I'm not 100% sure this is the right one axion intel # git bisect log git bisect start 'drivers/gpu/drm/i915/' # bad: [0fb1abf5eac2230894a7352e36022066f88a9b19] drm-intel-nightly: 2016y-10m-27d-14h-09m-00s UTC integration manifest git bisect bad 0fb1abf5eac2230894a7352e36022066f88a9b19 # good: [07d9a380680d1c0eb51ef87ff2eab5c994949e69] Linux 4.9-rc2 git bisect good 07d9a380680d1c0eb51ef87ff2eab5c994949e69 # bad: [a62163e97bafbc072093ee5645873227f33a43ee] drm/i915/gen9: Make skl_wm_level per-plane git bisect bad a62163e97bafbc072093ee5645873227f33a43ee # good: [cdb324bde5700725f04172bbeb6ef0bbbb6886c3] drm/i915: Show bounds of active request in the ring on GPU hang git bisect good cdb324bde5700725f04172bbeb6ef0bbbb6886c3 # good: [45353ce59b3ec606e0a35386ac04210b1656e829] drm/i915: Treat a framebuffer reference as an active reference whilst shrinking git bisect good 45353ce59b3ec606e0a35386ac04210b1656e829 # good: [fd6b8f43c9e9a3adc384423a1d3dfeefd38655ea] drm/i915: Make IS_IVYBRIDGE only take dev_priv git bisect good fd6b8f43c9e9a3adc384423a1d3dfeefd38655ea # good: [11a914c28679f19d7daf4218c698ac6c3e184e1a] drm/i915: Make IS_VALLEYVIEW only take dev_priv git bisect good 11a914c28679f19d7daf4218c698ac6c3e184e1a # bad: [d209b9c3cd281e4543e1150d173388b6d8f29a42] drm/i915/gtt: Split gen8_ppgtt_clear_pte_range git bisect bad d209b9c3cd281e4543e1150d173388b6d8f29a42 # good: [5db9401983ac7bf9ddc45de54c53ccfa31d21774] drm/i915: Make IS_GEN macros only take dev_priv git bisect good 5db9401983ac7bf9ddc45de54c53ccfa31d21774 # good: [4fb84d991ef2172d425234391d7215978345f6cd] drm/i915: Remove unused "valid" parameter from pte_encode git bisect good 4fb84d991ef2172d425234391d7215978345f6cd # first bad commit: [d209b9c3cd281e4543e1150d173388b6d8f29a42] drm/i915/gtt: Split gen8_ppgtt_clear_pte_range I originally had d209b9c3cd281e4543e1150d173388b6d8f29a42 down as good but it froze up when it was compiling the next kernel The commit doesn't revert cleanly, so I've not tested without this commit Created attachment 127580 [details] [review] dirty after cleanup I've applied that patch and I'm now running that kernel, I'm not physically at the machine just now but I'll check for errors and give it a proper text tonight when I'm home It still happens with this patch applied Created attachment 127642 [details] [review] drm/i915/gtt: Fix pte clear range There was a clear mistake in the commit you bisected into. I attached a fix. But the failure path wrt to this bug is unclear still. I have a warning now: [ 1.186964] WARNING: CPU: 5 PID: 69 at drivers/gpu/drm/i915/intel_dp.c:4023 intel_dp_check_link_status+0x1a3/0x1d0 [ 1.186964] WARN_ON_ONCE(!intel_dp->lane_count) [ 1.186964] Modules linked in: [ 1.186964] CPU: 5 PID: 69 Comm: kworker/u16:1 Tainted: G U 4.9.0-rc2-intel+ #124 [ 1.186964] Hardware name: Alienware Alienware 15 R2/Alienware 15 R2, BIOS 1.3.6 08/05/2016 [ 1.186964] Workqueue: events_unbound async_run_entry_fn [ 1.186964] 0000000000000000 ffffffff813548fc ffffc900035cfc88 0000000000000000 [ 1.186964] ffffffff8108f989 ffff8808a12840f0 ffffc900035cfcd8 ffff88089f4b0000 [ 1.186964] ffff88089f4b0258 ffff8808a12840f0 ffff8808a1284000 ffffffff8108f9fa [ 1.186964] Call Trace: [ 1.186964] [<ffffffff813548fc>] ? dump_stack+0x46/0x5a [ 1.186964] [<ffffffff8108f989>] ? __warn+0xb9/0xe0 [ 1.186964] [<ffffffff8108f9fa>] ? warn_slowpath_fmt+0x4a/0x50 [ 1.186964] [<ffffffff8145367b>] ? drm_dp_dpcd_read+0x4b/0x60 [ 1.186964] [<ffffffff815aefd3>] ? intel_dp_check_link_status+0x1a3/0x1d0 [ 1.186964] [<ffffffff815b4997>] ? intel_dp_detect+0x5f7/0x9e0 [ 1.186964] [<ffffffff814541c0>] ? drm_helper_probe_single_connector_modes+0x400/0x4d0 [ 1.186964] [<ffffffff810bb572>] ? sched_clock_local+0x12/0x80 [ 1.186964] [<ffffffff81461897>] ? drm_fb_helper_initial_config+0x77/0x420 [ 1.186964] [<ffffffff810b2a18>] ? finish_task_switch+0x78/0x1d0 [ 1.186964] [<ffffffff815a686f>] ? intel_fbdev_initial_config+0xf/0x20 [ 1.186964] [<ffffffff810af4ed>] ? async_run_entry_fn+0x2d/0xd0 [ 1.186964] [<ffffffff810a769e>] ? process_one_work+0x1ee/0x490 [ 1.186964] [<ffffffff810a7982>] ? worker_thread+0x42/0x4c0 [ 1.186964] [<ffffffff810a7940>] ? process_one_work+0x490/0x490 [ 1.186964] [<ffffffff810acc19>] ? kthread+0xb9/0xd0 [ 1.186964] [<ffffffff810acb60>] ? kthread_park+0x50/0x50 [ 1.186964] [<ffffffff81087312>] ? ret_from_fork+0x22/0x30 [ 1.186964] ---[ end trace aa2a2fb24b6f8b92 ]--- This is very early comnpared to where the freezes used to be, and now I don't get any freezes Reference to Mika's patch set: https://patchwork.freedesktop.org/series/14620/ Would you like me to test those two patches? I'm getting a similar WARN_ON_ONCE with current nightly with these patches seemingly already applied (GNU patch complains "Reversed (or previously applied) patch detected!" on both). What can we do to debug this further? [ 2.140178] ------------[ cut here ]------------ [ 2.140261] WARNING: CPU: 3 PID: 46 at drivers/gpu/drm/i915/intel_dp.c:4022 intel_dp_check_link_status+0x1d7/0x200 [i915] [ 2.140263] WARN_ON_ONCE(!intel_dp->lane_count) [ 2.140272] Modules linked in: i915 video button intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm [ 2.140276] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 4.9.0-1-drm-intel-nightly #1 [ 2.140278] Hardware name: Dell Inc. XPS 13 9350/07TYC2, BIOS 1.4.4 06/14/2016 [ 2.140286] Workqueue: events output_poll_execute [drm_kms_helper] [ 2.140292] ffffc90000e13bc8 ffffffff812f8de0 ffffc90000e13c18 0000000000000000 [ 2.140295] ffffc90000e13c08 ffffffff8107cddb 00000fb600e13c40 ffff8802760270f0 [ 2.140299] 0000000000000001 ffff880273c58000 ffff880273c58258 ffff880276027000 [ 2.140300] Call Trace: [ 2.140307] [<ffffffff812f8de0>] dump_stack+0x63/0x83 [ 2.140312] [<ffffffff8107cddb>] __warn+0xcb/0xf0 [ 2.140315] [<ffffffff8107ce5f>] warn_slowpath_fmt+0x5f/0x80 [ 2.140322] [<ffffffffa007d467>] ? drm_dp_dpcd_read+0x57/0x70 [drm_kms_helper] [ 2.140386] [<ffffffffa0172e67>] intel_dp_check_link_status+0x1d7/0x200 [i915] [ 2.140445] [<ffffffffa01790b7>] intel_dp_detect+0x697/0xa40 [i915] [ 2.140452] [<ffffffffa007e15f>] drm_helper_probe_single_connector_modes+0x3ff/0x4f0 [drm_kms_helper] [ 2.140459] [<ffffffffa008d26b>] drm_fb_helper_hotplug_event+0x10b/0x150 [drm_kms_helper] [ 2.140516] [<ffffffffa0169a14>] intel_fbdev_output_poll_changed+0x24/0x30 [i915] [ 2.140522] [<ffffffffa007da27>] drm_kms_helper_hotplug_event+0x27/0x30 [drm_kms_helper] [ 2.140528] [<ffffffffa007dc28>] output_poll_execute+0x198/0x1e0 [drm_kms_helper] [ 2.140533] [<ffffffff81096b35>] process_one_work+0x1e5/0x470 [ 2.140537] [<ffffffff81096e08>] worker_thread+0x48/0x4e0 [ 2.140541] [<ffffffff81096dc0>] ? process_one_work+0x470/0x470 [ 2.140544] [<ffffffff81096dc0>] ? process_one_work+0x470/0x470 [ 2.140547] [<ffffffff8109c999>] kthread+0xd9/0xf0 [ 2.140551] [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630 [ 2.140553] [<ffffffff8109c8c0>] ? kthread_park+0x60/0x60 [ 2.140559] [<ffffffff815fbdd5>] ret_from_fork+0x25/0x30 [ 2.140562] ---[ end trace 4fe5657f076aec76 ]--- Created attachment 127706 [details]
full dmesg output on DELL XPS 13 after boot
Fwiw, the warning is unrelated to the hangs, bug 98374. As the freezes are gone (hopefully still) and the patches are upstream, onwards. commit 37c6393431bf526d6f465e095c1201c1b890dd51 Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> Date: Tue Nov 1 15:27:36 2016 +0200 drm/i915/gtt: Fix pte clear range (In reply to Chris Wilson from comment #21) > Fwiw, the warning is unrelated to the hangs, bug 98374. As the freezes are > gone (hopefully still) and the patches are upstream, onwards. > > commit 37c6393431bf526d6f465e095c1201c1b890dd51 > Author: Mika Kuoppala <mika.kuoppala@linux.intel.com> > Date: Tue Nov 1 15:27:36 2016 +0200 > > drm/i915/gtt: Fix pte clear range Closing as fixed (upstream + no hang) |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.
Created attachment 127336 [details] Dmesg Tried out nightly for the new skl watermark patches Plasma 5 freezes up and eventually I'm taken back to the login screen Attaching the dmesg and the error file