Bug 101231

Summary: [BDW] igt@kms_busy@extended-pageflip-hang-newfb-default-A hard hang
Product: DRI Reporter: Marta Löfstedt <marta.lofstedt>
Component: DRM/IntelAssignee: Marta Löfstedt <marta.lofstedt>
Status: CLOSED FIXED QA Contact: Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity: normal    
Priority: medium CC: intel-gfx-bugs
Version: DRI git   
Hardware: Other   
OS: All   
Whiteboard: ReadyForDev
i915 platform: BDW i915 features: display/Other

Description Marta Löfstedt 2017-05-30 10:32:45 UTC
./tests/kms_busy --run-subtest extended-pageflip-hang-newfb-default-A

drm-tip:
commit afd6d4ac7d9c06649fe00300dd608bccc189ca22
Author: Daniel Vetter <daniel.vetter@ffwll.ch>
Date:   Mon May 29 21:13:32 2017 +0200

    drm-tip: 2017y-05m-29d-19h-12m-29s UTC integration manifest

igt:
IGT-Version: 1.18-g00ce341 (x86_64) (Linux: 4.12.0-rc2+ x86_64)


[ 2176.095496] INFO: task kworker/1:0:17 blocked for more than 120 seconds.
[ 2176.095502]       Tainted: G        W       4.12.0-rc2+ #5
[ 2176.095504] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 2176.095507] kworker/1:0     D    0    17      2 0x00000000
[ 2176.095562] Workqueue: events_long i915_hangcheck_elapsed [i915]
[ 2176.095564] Call Trace:
[ 2176.095571]  __schedule+0x3c4/0x820
[ 2176.095574]  schedule+0x36/0x80
[ 2176.095577]  schedule_preempt_disabled+0xe/0x10
[ 2176.095580]  __ww_mutex_lock.isra.6+0x3de/0x6b0
[ 2176.095583]  __ww_mutex_lock_slowpath+0x16/0x20
[ 2176.095586]  ? __ww_mutex_lock_slowpath+0x16/0x20
[ 2176.095588]  ww_mutex_lock+0x37/0x70
[ 2176.095604]  modeset_backoff+0x4b/0xc0 [drm]
[ 2176.095616]  drm_modeset_backoff+0x10/0x20 [drm]
[ 2176.095649]  intel_prepare_reset+0x38/0xe0 [i915]
[ 2176.095670]  i915_reset_and_wakeup+0xad/0x190 [i915]
[ 2176.095692]  i915_handle_error+0x1df/0x220 [i915]
[ 2176.095695]  ? scnprintf+0x49/0x80
[ 2176.095724]  hangcheck_declare_hang+0xe2/0x110 [i915]
[ 2176.095753]  ? gen6_read32+0x9f/0x1c0 [i915]
[ 2176.095782]  i915_hangcheck_elapsed+0x29f/0x2d0 [i915]
[ 2176.095785]  process_one_work+0x1e9/0x3f0
[ 2176.095787]  worker_thread+0x4b/0x410
[ 2176.095789]  kthread+0x109/0x140
[ 2176.095790]  ? process_one_work+0x3f0/0x3f0
[ 2176.095793]  ? kthread_create_on_node+0x70/0x70
[ 2176.095796]  ret_from_fork+0x2c/0x40
[ 2176.095799] NMI backtrace for cpu 0
[ 2176.095801] CPU: 0 PID: 36 Comm: khungtaskd Tainted: G        W       4.12.0-rc2+ #5
[ 2176.095802] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0249.2015.0529.1640 05/29/2015
[ 2176.095803] Call Trace:
[ 2176.095805]  dump_stack+0x63/0x8d
[ 2176.095808]  nmi_cpu_backtrace+0x94/0xa0
[ 2176.095811]  ? irq_force_complete_move+0x150/0x150
[ 2176.095814]  nmi_trigger_cpumask_backtrace+0xff/0x130
[ 2176.095816]  arch_trigger_cpumask_backtrace+0x19/0x20
[ 2176.095818]  watchdog+0x2d8/0x360
[ 2176.095821]  kthread+0x109/0x140
[ 2176.095823]  ? reset_hung_task_detector+0x20/0x20
[ 2176.095825]  ? kthread_create_on_node+0x70/0x70
[ 2176.095827]  ret_from_fork+0x2c/0x40
[ 2176.095830] Sending NMI from CPU 0 to CPUs 1-3:
Comment 1 Marta Löfstedt 2017-06-14 07:55:52 UTC
reproduced on HSW Harrisbeach
drm-tip:
 commit d4bedb8b0f9ba91df2e8cb136a489145a83e96a7
Author: Sean Paul <seanpaul@chromium.org>
Date:   Tue Jun 13 10:23:36 2017 -0400

    drm-tip: 2017y-06m-13d-14h-22m-46s UTC integration manifest
Comment 2 Marta Löfstedt 2017-06-15 07:56:16 UTC
Maarten believes this is a duplicate of bug 99093. 
I can reproduce this hard hang 100% on BDW NUCi5 and Harrisbeach HSW.

*** This bug has been marked as a duplicate of bug 99093 ***
Comment 3 Marta Löfstedt 2017-06-26 09:28:23 UTC
Since Chris now has a workaround for the PNV hang in: https://bugs.freedesktop.org/show_bug.cgi?id=99093 
which doesn't fix this issue. I will un-duplicate this bug and continue working on it.
Comment 4 Marta Löfstedt 2017-06-26 10:19:44 UTC
tested some of the kms_busy --run-subtest extended-*
with Villes branch:
git://github.com/vsyrjala/linux.git no_commit_reordering

Still hit some deadlock:
[ 1450.965818] INFO: task kworker/2:3:2643 blocked for more than 120 seconds.
[ 1450.965827]       Tainted: G     U          4.12.0-rc4+ #2
[ 1450.965830] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1450.965833] kworker/2:3     D    0  2643      2 0x00000000
[ 1450.965924] Workqueue: events_long i915_hangcheck_elapsed [i915]
[ 1450.965927] Call Trace:
[ 1450.965939]  __schedule+0x3c4/0x820
[ 1450.965946]  schedule+0x36/0x80
[ 1450.965951]  schedule_preempt_disabled+0xe/0x10
[ 1450.965956]  __ww_mutex_lock.isra.6+0x3de/0x6b0
[ 1450.965963]  __ww_mutex_lock_slowpath+0x16/0x20
[ 1450.965968]  ? __ww_mutex_lock_slowpath+0x16/0x20
[ 1450.965973]  ww_mutex_lock+0x37/0x70
[ 1450.966009]  modeset_backoff+0x4b/0xc0 [drm]
[ 1450.966039]  drm_modeset_backoff+0x10/0x20 [drm]
[ 1450.966118]  intel_prepare_reset+0x4e/0xe0 [i915]
[ 1450.966171]  i915_reset_and_wakeup+0xad/0x190 [i915]
[ 1450.966224]  i915_handle_error+0x1df/0x220 [i915]
[ 1450.966229]  ? scnprintf+0x49/0x80
[ 1450.966298]  hangcheck_declare_hang+0xe2/0x110 [i915]
[ 1450.966363]  ? gen6_read32+0x9f/0x1c0 [i915]
[ 1450.966424]  i915_hangcheck_elapsed+0x29f/0x2d0 [i915]
[ 1450.966430]  process_one_work+0x1e9/0x3f0
[ 1450.966433]  worker_thread+0x4b/0x410
[ 1450.966439]  kthread+0x109/0x140
[ 1450.966443]  ? process_one_work+0x3f0/0x3f0
[ 1450.966448]  ? kthread_create_on_node+0x70/0x70
[ 1450.966452]  ret_from_fork+0x25/0x30
[ 1450.966456] NMI backtrace for cpu 0
[ 1450.966461] CPU: 0 PID: 36 Comm: khungtaskd Tainted: G     U          4.12.0-rc4+ #2
[ 1450.966463] Hardware name:                  /NUC5i5RYB, BIOS RYBDWi35.86A.0249.2015.0529.1640 05/29/2015
[ 1450.966465] Call Trace:
[ 1450.966470]  dump_stack+0x63/0x8d
[ 1450.966475]  nmi_cpu_backtrace+0x94/0xa0
[ 1450.966479]  ? irq_force_complete_move+0x140/0x140
[ 1450.966484]  nmi_trigger_cpumask_backtrace+0xde/0x110
Comment 5 Marta Löfstedt 2017-06-28 12:43:12 UTC
Note: that Maarten now has an IGT patch where the bad behvior of the igt@kms_busy@extended-* tests no longer appear reproducible:

commit 9120ee572f86ed6f3da088187043785d1da340c9
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Wed Jun 28 12:49:35 2017 +0200

    tests/kms_busy: Remove gem_bo_busy checks


On the other hand Ville is still working with the real fix in the drive
Comment 6 Marta Löfstedt 2017-06-29 06:27:03 UTC
I was to optimistic on the impact of Maartens IGT patch. After running all kms test during the night the same issue was reproduced.
Comment 7 Marta Löfstedt 2017-08-02 06:21:25 UTC
I can't reproduce this issue on BDW drm-tip 4.13.0-rc2+ git@f9cb5a18. However, the tests are now skipped.
Comment 8 Marta Löfstedt 2017-08-08 06:57:52 UTC
I can no longer reproduce this issue.
Comment 9 Maarten Lankhorst 2017-08-14 13:34:13 UTC
Most likely fixed by the following commit:

commit c4bbb7358425b3a1c93f4d8c12a8fe46438ca73b
Author: Maarten Lankhorst <maarten.lankhorst@linux.intel.com>
Date:   Tue Jul 11 16:33:14 2017 +0200

    drm/atomic: Allow drm_atomic_helper_swap_state to fail

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.