Bug 98282

Summary:

[skl][bisected] GPU Hangs on drm-intel-nightly

Product:

DRI

Reporter:

Mike Lothian <mike>

Component:

DRM/Intel

Assignee:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Status:

CLOSED FIXED

QA Contact:

Intel GFX Bugs mailing list <intel-gfx-bugs>

Severity:

normal

Priority:

medium

CC:

artjom.simon, intel-gfx-bugs, mike

Version:

XOrg git

Keywords:

bisected

Hardware:

Other

OS:

All

Whiteboard:

i915 platform:

SKL

i915 features:

Attachments:

Description	Flags
Dmesg	none
Error	none
Dmesg with GuC disabled	none
Error	none
dirty after cleanup	none
drm/i915/gtt: Fix pte clear range	none
full dmesg output on DELL XPS 13 after boot	none

Description Mike Lothian 2016-10-17 00:22:58 UTC

Created attachment 127336 [details]
Dmesg

Tried out nightly for the new skl watermark patches

Plasma 5 freezes up and eventually I'm taken back to the login screen

Attaching the dmesg and the error file

Comment 1 Mike Lothian 2016-10-17 00:23:19 UTC

Created attachment 127337 [details]
Error

Comment 2 Mike Lothian 2016-10-17 00:30:58 UTC

Please feel free to ignore this, forgot I was testing GuC, doesn't seem to happen when this is disabled - though it's still a regression compared to the other kernels I've been using

Comment 3 Mike Lothian 2016-10-17 00:33:49 UTC

Spoke too soon, it just seems to take a bit longer with GuC disabled and X didn't crash

Comment 4 Mike Lothian 2016-10-17 00:34:19 UTC

Created attachment 127338 [details]
Dmesg with GuC disabled

Comment 5 Mike Lothian 2016-10-17 00:35:44 UTC

Created attachment 127339 [details]
Error

Comment 6 Chris Wilson 2016-10-17 07:42:28 UTC

DMAR telltale right before the hang. "Unknown error" fantastic. Try intel_iommu=igfx_off

Comment 7 Mike Lothian 2016-10-17 13:49:28 UTC

I tried that for a few hours and didn't get any locks ups

Comment 8 yann 2016-10-19 09:02:34 UTC

dup of 89360?

Comment 9 Mike Lothian 2016-10-27 13:43:32 UTC

Have you been able to reproduce this?

Comment 10 Mike Lothian 2016-10-27 20:54:03 UTC

I've bisected this as well as I can and got:

axion intel # git bisect good
d209b9c3cd281e4543e1150d173388b6d8f29a42 is the first bad commit
commit d209b9c3cd281e4543e1150d173388b6d8f29a42
Author: Michał Winiarski <michal.winiarski@intel.com>
Date:   Thu Oct 13 14:02:41 2016 +0200

    drm/i915/gtt: Split gen8_ppgtt_clear_pte_range
    
    Let's use more top-down approach, where each gen8_ppgtt_clear_* function
    is responsible for clearing the struct passed as an argument and calling
    relevant clear_range functions on lower-level tables.
    Doing this rather than operating on PTE ranges makes the implementation
    of shrinking page tables quite simple.
    
    v2: Drop min when calculating num_entries, no negation in 48b ppgtt
    check, no newlines in vars block (Joonas)
    
    Cc: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Michel Thierry <michel.thierry@intel.com>
    Cc: Mika Kuoppala <mika.kuoppala@intel.com>
    Signed-off-by: Michał Winiarski <michal.winiarski@intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: http://patchwork.freedesktop.org/patch/msgid/1476360162-24062-2-git-send-email-michal.winiarski@intel.com

:040000 040000 3d4b545bc817fdcb2b262dd33d3f95cc2f7612b2 4ef823c1499311cc8d4374ca20133e24ebbd323a M      drivers


As this bug doesn't always manifest itself I'm not 100% sure this is the right one

axion intel # git bisect log
git bisect start 'drivers/gpu/drm/i915/'
# bad: [0fb1abf5eac2230894a7352e36022066f88a9b19] drm-intel-nightly: 2016y-10m-27d-14h-09m-00s UTC integration manifest
git bisect bad 0fb1abf5eac2230894a7352e36022066f88a9b19
# good: [07d9a380680d1c0eb51ef87ff2eab5c994949e69] Linux 4.9-rc2
git bisect good 07d9a380680d1c0eb51ef87ff2eab5c994949e69
# bad: [a62163e97bafbc072093ee5645873227f33a43ee] drm/i915/gen9: Make skl_wm_level per-plane
git bisect bad a62163e97bafbc072093ee5645873227f33a43ee
# good: [cdb324bde5700725f04172bbeb6ef0bbbb6886c3] drm/i915: Show bounds of active request in the ring on GPU hang
git bisect good cdb324bde5700725f04172bbeb6ef0bbbb6886c3
# good: [45353ce59b3ec606e0a35386ac04210b1656e829] drm/i915: Treat a framebuffer reference as an active reference whilst shrinking
git bisect good 45353ce59b3ec606e0a35386ac04210b1656e829
# good: [fd6b8f43c9e9a3adc384423a1d3dfeefd38655ea] drm/i915: Make IS_IVYBRIDGE only take dev_priv                                                                                                                                                             
git bisect good fd6b8f43c9e9a3adc384423a1d3dfeefd38655ea                                                                                                                                                                                                      
# good: [11a914c28679f19d7daf4218c698ac6c3e184e1a] drm/i915: Make IS_VALLEYVIEW only take dev_priv                                                                                                                                                            
git bisect good 11a914c28679f19d7daf4218c698ac6c3e184e1a                                                                                                                                                                                                      
# bad: [d209b9c3cd281e4543e1150d173388b6d8f29a42] drm/i915/gtt: Split gen8_ppgtt_clear_pte_range                                                                                                                                                              
git bisect bad d209b9c3cd281e4543e1150d173388b6d8f29a42                                                                                                                                                                                                       
# good: [5db9401983ac7bf9ddc45de54c53ccfa31d21774] drm/i915: Make IS_GEN macros only take dev_priv                                                                                                                                                            
git bisect good 5db9401983ac7bf9ddc45de54c53ccfa31d21774                                                                                                                                                                                                      
# good: [4fb84d991ef2172d425234391d7215978345f6cd] drm/i915: Remove unused "valid" parameter from pte_encode                                                                                                                                                  
git bisect good 4fb84d991ef2172d425234391d7215978345f6cd                                                                                                                                                                                                      
# first bad commit: [d209b9c3cd281e4543e1150d173388b6d8f29a42] drm/i915/gtt: Split gen8_ppgtt_clear_pte_range 

I originally had d209b9c3cd281e4543e1150d173388b6d8f29a42 down as good but it froze up when it was compiling the next kernel

The commit doesn't revert cleanly, so I've not tested without this commit

Comment 11 Mika Kuoppala 2016-10-28 08:12:37 UTC

Created attachment 127580 [details] [review]
dirty after cleanup

Comment 12 Mike Lothian 2016-10-28 09:03:50 UTC

I've applied that patch and I'm now running that kernel, I'm not physically at the machine just now but I'll check for errors and give it a proper text tonight when I'm home

Comment 13 Mike Lothian 2016-10-28 18:13:13 UTC

It still happens with this patch applied

Comment 14 Mika Kuoppala 2016-10-31 16:04:17 UTC

Created attachment 127642 [details] [review]
drm/i915/gtt: Fix pte clear range

Comment 15 Mika Kuoppala 2016-10-31 16:05:58 UTC

There was a clear mistake in the commit you bisected into. I attached a fix.
But the failure path wrt to this bug is unclear still.

Comment 16 Mike Lothian 2016-10-31 22:31:42 UTC

I have a warning now:

[    1.186964] WARNING: CPU: 5 PID: 69 at drivers/gpu/drm/i915/intel_dp.c:4023 intel_dp_check_link_status+0x1a3/0x1d0
[    1.186964] WARN_ON_ONCE(!intel_dp->lane_count)
[    1.186964] Modules linked in:
[    1.186964] CPU: 5 PID: 69 Comm: kworker/u16:1 Tainted: G     U          4.9.0-rc2-intel+ #124
[    1.186964] Hardware name: Alienware Alienware 15 R2/Alienware 15 R2, BIOS 1.3.6 08/05/2016
[    1.186964] Workqueue: events_unbound async_run_entry_fn
[    1.186964]  0000000000000000 ffffffff813548fc ffffc900035cfc88 0000000000000000
[    1.186964]  ffffffff8108f989 ffff8808a12840f0 ffffc900035cfcd8 ffff88089f4b0000
[    1.186964]  ffff88089f4b0258 ffff8808a12840f0 ffff8808a1284000 ffffffff8108f9fa
[    1.186964] Call Trace:
[    1.186964]  [<ffffffff813548fc>] ? dump_stack+0x46/0x5a
[    1.186964]  [<ffffffff8108f989>] ? __warn+0xb9/0xe0
[    1.186964]  [<ffffffff8108f9fa>] ? warn_slowpath_fmt+0x4a/0x50
[    1.186964]  [<ffffffff8145367b>] ? drm_dp_dpcd_read+0x4b/0x60
[    1.186964]  [<ffffffff815aefd3>] ? intel_dp_check_link_status+0x1a3/0x1d0
[    1.186964]  [<ffffffff815b4997>] ? intel_dp_detect+0x5f7/0x9e0
[    1.186964]  [<ffffffff814541c0>] ? drm_helper_probe_single_connector_modes+0x400/0x4d0
[    1.186964]  [<ffffffff810bb572>] ? sched_clock_local+0x12/0x80
[    1.186964]  [<ffffffff81461897>] ? drm_fb_helper_initial_config+0x77/0x420
[    1.186964]  [<ffffffff810b2a18>] ? finish_task_switch+0x78/0x1d0
[    1.186964]  [<ffffffff815a686f>] ? intel_fbdev_initial_config+0xf/0x20
[    1.186964]  [<ffffffff810af4ed>] ? async_run_entry_fn+0x2d/0xd0
[    1.186964]  [<ffffffff810a769e>] ? process_one_work+0x1ee/0x490
[    1.186964]  [<ffffffff810a7982>] ? worker_thread+0x42/0x4c0
[    1.186964]  [<ffffffff810a7940>] ? process_one_work+0x490/0x490
[    1.186964]  [<ffffffff810acc19>] ? kthread+0xb9/0xd0
[    1.186964]  [<ffffffff810acb60>] ? kthread_park+0x50/0x50
[    1.186964]  [<ffffffff81087312>] ? ret_from_fork+0x22/0x30
[    1.186964] ---[ end trace aa2a2fb24b6f8b92 ]---

This is very early comnpared to where the freezes used to be, and now I don't get any freezes

Comment 17 yann 2016-11-02 09:27:20 UTC

Reference to Mika's patch set: https://patchwork.freedesktop.org/series/14620/

Comment 18 Mike Lothian 2016-11-02 09:38:13 UTC

Would you like me to test those two patches?

Comment 19 Artjom Simon 2016-11-02 23:38:34 UTC

I'm getting a similar WARN_ON_ONCE with current nightly with these patches seemingly already applied (GNU patch complains "Reversed (or previously applied) patch detected!" on both).

What can we do to debug this further?


[    2.140178] ------------[ cut here ]------------
[    2.140261] WARNING: CPU: 3 PID: 46 at drivers/gpu/drm/i915/intel_dp.c:4022 intel_dp_check_link_status+0x1d7/0x200 [i915]
[    2.140263] WARN_ON_ONCE(!intel_dp->lane_count)
[    2.140272] Modules linked in: i915 video button intel_gtt i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops drm
[    2.140276] CPU: 3 PID: 46 Comm: kworker/3:1 Not tainted 4.9.0-1-drm-intel-nightly #1
[    2.140278] Hardware name: Dell Inc. XPS 13 9350/07TYC2, BIOS 1.4.4 06/14/2016
[    2.140286] Workqueue: events output_poll_execute [drm_kms_helper]
[    2.140292]  ffffc90000e13bc8 ffffffff812f8de0 ffffc90000e13c18 0000000000000000
[    2.140295]  ffffc90000e13c08 ffffffff8107cddb 00000fb600e13c40 ffff8802760270f0
[    2.140299]  0000000000000001 ffff880273c58000 ffff880273c58258 ffff880276027000
[    2.140300] Call Trace:
[    2.140307]  [<ffffffff812f8de0>] dump_stack+0x63/0x83
[    2.140312]  [<ffffffff8107cddb>] __warn+0xcb/0xf0
[    2.140315]  [<ffffffff8107ce5f>] warn_slowpath_fmt+0x5f/0x80
[    2.140322]  [<ffffffffa007d467>] ? drm_dp_dpcd_read+0x57/0x70 [drm_kms_helper]
[    2.140386]  [<ffffffffa0172e67>] intel_dp_check_link_status+0x1d7/0x200 [i915]
[    2.140445]  [<ffffffffa01790b7>] intel_dp_detect+0x697/0xa40 [i915]
[    2.140452]  [<ffffffffa007e15f>] drm_helper_probe_single_connector_modes+0x3ff/0x4f0 [drm_kms_helper]
[    2.140459]  [<ffffffffa008d26b>] drm_fb_helper_hotplug_event+0x10b/0x150 [drm_kms_helper]
[    2.140516]  [<ffffffffa0169a14>] intel_fbdev_output_poll_changed+0x24/0x30 [i915]
[    2.140522]  [<ffffffffa007da27>] drm_kms_helper_hotplug_event+0x27/0x30 [drm_kms_helper]
[    2.140528]  [<ffffffffa007dc28>] output_poll_execute+0x198/0x1e0 [drm_kms_helper]
[    2.140533]  [<ffffffff81096b35>] process_one_work+0x1e5/0x470
[    2.140537]  [<ffffffff81096e08>] worker_thread+0x48/0x4e0
[    2.140541]  [<ffffffff81096dc0>] ? process_one_work+0x470/0x470
[    2.140544]  [<ffffffff81096dc0>] ? process_one_work+0x470/0x470
[    2.140547]  [<ffffffff8109c999>] kthread+0xd9/0xf0
[    2.140551]  [<ffffffff8102d752>] ? __switch_to+0x2d2/0x630
[    2.140553]  [<ffffffff8109c8c0>] ? kthread_park+0x60/0x60
[    2.140559]  [<ffffffff815fbdd5>] ret_from_fork+0x25/0x30
[    2.140562] ---[ end trace 4fe5657f076aec76 ]---

Comment 20 Artjom Simon 2016-11-02 23:47:02 UTC

Created attachment 127706 [details]
full dmesg output on DELL XPS 13 after boot

Comment 21 Chris Wilson 2016-11-06 17:19:12 UTC

Fwiw, the warning is unrelated to the hangs, bug 98374. As the freezes are gone (hopefully still) and the patches are upstream, onwards.

commit 37c6393431bf526d6f465e095c1201c1b890dd51
Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
Date:   Tue Nov 1 15:27:36 2016 +0200

    drm/i915/gtt: Fix pte clear range

Comment 22 yann 2016-11-07 12:55:51 UTC

(In reply to Chris Wilson from comment #21)
> Fwiw, the warning is unrelated to the hangs, bug 98374. As the freezes are
> gone (hopefully still) and the patches are upstream, onwards.
> 
> commit 37c6393431bf526d6f465e095c1201c1b890dd51
> Author: Mika Kuoppala <mika.kuoppala@linux.intel.com>
> Date:   Tue Nov 1 15:27:36 2016 +0200
> 
>     drm/i915/gtt: Fix pte clear range

Closing as fixed (upstream + no hang)

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.