Bug 100232 - [BAT] IGT gem_exec_parallel hangs half of the time on BDW+ testhosts
Summary: [BAT] IGT gem_exec_parallel hangs half of the time on BDW+ testhosts
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: DRI git
Hardware: Other All
: medium normal
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2017-03-16 15:48 UTC by Tomi Sarvela
Modified: 2017-07-24 22:38 UTC (History)
2 users (show)

See Also:
i915 platform: BDW, BXT, KBL, SKL
i915 features: GEM/execlists


Attachments
dmesg from SKL CI_DRM_2352 (43.67 KB, text/x-log)
2017-03-16 15:55 UTC, Tomi Sarvela
no flags Details

Description Tomi Sarvela 2017-03-16 15:48:19 UTC
One of the subtests in gem_exec_parallel often hangs the host. Below is dump from SKL6700K on Z170 MB, hanged hard on igt@gem_exec_parallel@render-fds after running tests/intel-ci/fast-feedback.testlist

CI_DRM_2352 is drm-tip, todays build. For details https://intel-gfx-ci.01.org/CI/

[  947.215802] general protection fault: 0000 [#1] PREEMPT SMP
[  947.221439] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi snd_hda_codec_realtek x
86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul gh
ash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei e1000e igb ptp pps_core pr
ime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid [last unloaded: i915]
[  947.254918] CPU: 6 PID: 47 Comm: ksoftirqd/6 Tainted: G     U          4.11.0-rc2-CI-CI_DRM_2352+
 #1
[  947.264181] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F21 01/06/2
017
[  947.273499] task: ffff88042bdaa7c0 task.stack: ffffc900001fc000
[  947.279489] RIP: 0010:notifier_call_chain+0x59/0xa0
[  947.284426] RSP: 0018:ffffc900001ffd38 EFLAGS: 00010286
[  947.289698] RAX: 0000000000000001 RBX: 00000000ffffffff RCX: 00000000ffffffff
[  947.296917] RDX: ffff8803bf65d5c0 RSI: 0000000000000001 RDI: ffff88041d05e4c8
[  947.304249] RBP: ffffc900001ffd70 R08: 0000000000000000 R09: 643e07b800000000
[  947.311544] R10: 0000000000000000 R11: ffff88042bdaa7c0 R12: 0000000000000000
[  947.318833] R13: 0000000000000000 R14: 00000000ffffffff R15: 6b6b6b6b6b6b6b6b
[  947.326070] FS:  0000000000000000(0000) GS:ffff88043ed80000(0000) knlGS:0000000000000000
[  947.334408] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  947.340242] CR2: 00007f83d8000010 CR3: 0000000429021000 CR4: 00000000003406e0
[  947.347512] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[  947.354801] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[  947.362080] Call Trace:
[  947.364582]  __atomic_notifier_call_chain+0x73/0x110
[  947.369709]  ? unregister_die_notifier+0x20/0x20
[  947.374406]  atomic_notifier_call_chain+0x11/0x20
[  947.379276]  intel_lrc_irq_handler+0x191/0x490 [i915]
[  947.384458]  tasklet_hi_action+0xf0/0x110
[  947.388611]  __do_softirq+0x116/0x4c0
[  947.392321]  run_ksoftirqd+0x22/0x50
[  947.395960]  smpboot_thread_fn+0x180/0x280
[  947.400129]  kthread+0x107/0x140
[  947.403431]  ? sort_range+0x20/0x20
[  947.407002]  ? kthread_create_on_node+0x40/0x40
[  947.411621]  ret_from_fork+0x2e/0x40
[  947.415233] Code: 4c 89 ff 41 ff 17 4d 85 e4 41 89 c5 74 05 41 83 04 24 01 41 f7 c5 00 80 00 00 7
5 39 83 eb 01 4d 89 f7 4d 85 ff 74 2e 85 db 74 2a <49> 8b 3f 4d 8b 77 08 e8 cb ca ff ff 85 c0 75 bd 
48 c7 c2 04 f1 
[  947.434475] RIP: notifier_call_chain+0x59/0xa0 RSP: ffffc900001ffd38
[  947.440931] ---[ end trace e6564010da93ee3e ]---
[  947.608936] Kernel panic - not syncing: Fatal exception in interrupt
[  947.615465] Kernel Offset: disabled
[  947.791838] ---[ end Kernel panic - not syncing: Fatal exception in interrupt
[  947.799101] ------------[ cut here ]------------
[  947.803805] WARNING: CPU: 6 PID: 47 at arch/x86/kernel/smp.c:127 native_smp_send_reschedule+0x3a/
0x40
[  947.813181] Modules linked in: snd_hda_intel i915 vgem snd_hda_codec_hdmi snd_hda_codec_realtek x
86_pkg_temp_thermal snd_hda_codec_generic intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul gh
ash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm mei_me mei e1000e igb ptp pps_core pr
ime_numbers pinctrl_sunrisepoint pinctrl_intel i2c_hid [last unloaded: i915]
[  947.846626] CPU: 6 PID: 47 Comm: ksoftirqd/6 Tainted: G     UD         4.11.0-rc2-CI-CI_DRM_2352+
 #1
[  947.855898] Hardware name: Gigabyte Technology Co., Ltd. Z170X-UD5/Z170X-UD5-CF, BIOS F21 01/06/2
017
[  947.865208] Call Trace:
[  947.867711]  <IRQ>
[  947.869768]  dump_stack+0x67/0x92
[  947.873129]  __warn+0xc6/0xe0
[  947.876145]  warn_slowpath_null+0x18/0x20
[  947.880218]  native_smp_send_reschedule+0x3a/0x40
[  947.885019]  trigger_load_balance+0x2cd/0x580
[  947.889448]  ? trigger_load_balance+0x6f/0x580
[  947.893956]  scheduler_tick+0x97/0xc0
[  947.897673]  ? tick_sched_handle.isra.7+0x40/0x40
[  947.902458]  update_process_times+0x42/0x50
[  947.906721]  tick_sched_handle.isra.7+0x1c/0x40
[  947.911356]  tick_sched_timer+0x3d/0x70
[  947.915249]  __hrtimer_run_queues+0xf3/0x530
[  947.919590]  hrtimer_interrupt+0xb9/0x210
[  947.923655]  local_apic_timer_interrupt+0x31/0x50
[  947.928449]  smp_apic_timer_interrupt+0x33/0x50
[  947.933084]  apic_timer_interrupt+0x90/0xa0
[  947.937330] RIP: 0010:panic+0x1c7/0x205
[  947.941231] RSP: 0018:ffffc900001ffb90 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff10
[  947.948909] RAX: 0000000000000041 RBX: 0000000000000000 RCX: 0000000000000000
[  947.956191] RDX: 0000000000000101 RSI: ffffffff81c6e65d RDI: ffffffff8117ef23
[  947.963410] RBP: ffffc900001ffc00 R08: 0000000000000001 R09: 0000000000000000
[  947.970664] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
[  947.977909] R13: 0000000000000000 R14: 0000000000000000 R15: 6b6b6b6b6b6b6b6b
[  947.985170]  </IRQ>
[  947.987320]  ? panic+0x1c4/0x205
[  947.990604]  ? kmsg_dump+0x11f/0x1c0
[  947.994237]  oops_end+0x78/0x90
[  947.997435]  die+0x46/0x60
[  948.000172]  do_general_protection+0xe0/0x1a0
[  948.004610]  general_protection+0x22/0x30
[  948.008683] RIP: 0010:notifier_call_chain+0x59/0xa0
[  948.013650] RSP: 0018:ffffc900001ffd38 EFLAGS: 00010286
[  948.018963] RAX: 0000000000000001 RBX: 00000000ffffffff RCX: 00000000ffffffff
[  948.026224] RDX: ffff8803bf65d5c0 RSI: 0000000000000001 RDI: ffff88041d05e4c8
[  948.033487] RBP: ffffc900001ffd70 R08: 0000000000000000 R09: 643e07b800000000
[  948.040759] R10: 0000000000000000 R11: ffff88042bdaa7c0 R12: 0000000000000000
[  948.048020] R13: 0000000000000000 R14: 00000000ffffffff R15: 6b6b6b6b6b6b6b6b
[  948.055277]  __atomic_notifier_call_chain+0x73/0x110
[  948.060328]  ? unregister_die_notifier+0x20/0x20
[  948.065034]  atomic_notifier_call_chain+0x11/0x20
[  948.069835]  intel_lrc_irq_handler+0x191/0x490 [i915]
[  948.074960]  tasklet_hi_action+0xf0/0x110
[  948.079023]  __do_softirq+0x116/0x4c0
[  948.082766]  run_ksoftirqd+0x22/0x50
[  948.086406]  smpboot_thread_fn+0x180/0x280
[  948.090593]  kthread+0x107/0x140
[  948.093867]  ? sort_range+0x20/0x20
[  948.097421]  ? kthread_create_on_node+0x40/0x40
[  948.102041]  ret_from_fork+0x2e/0x40
[  948.105681] ---[ end trace e6564010da93ee3f ]---
Comment 1 Tomi Sarvela 2017-03-16 15:55:56 UTC
Created attachment 130263 [details]
dmesg from SKL CI_DRM_2352

Added dmesg from boot
Comment 2 Chris Wilson 2017-03-16 16:32:19 UTC
commit 3fc03069bc6e6c316f19bb526e3c8ce784677477
Author: Changbin Du <changbin.du@intel.com>
Date:   Mon Mar 13 10:47:11 2017 +0800

    drm/i915: make context status notifier head be per engine
    
    GVTg has introduced the context status notifier to schedule the GVTg
    workload. At that time, the notifier is bound to GVTg context only,
    so GVTg is not aware of host workloads.
    
    Now we are going to improve GVTg's guest workload scheduler policy,
    and add Guc emulation support for new Gen graphics. Both these two
    features require acknowledgment for all contexts running on hardware.
    (But will not alter host workload.) So here try to make some change.
    
    The change is simple:
      1. Move the context status notifier head from i915_gem_context to
         intel_engine_cs. Which means there is a notifier head per engine
         instead of per context. Execlist driver still call notifier for
         each context sched-in/out events of current engine.
      2. At GVTg side, it binds a notifier_block for each physical engine
         at GVTg initialization period. Then GVTg can hear all context
         status events.
    
    In this patch, GVTg do nothing for host context event, but later
    will add a function there. But in any case, the notifier callback is
    a noop if this is no active vGPU.
    
    Since intel_gvt_init() is called at early initialization stage and
    require the status notifier head has been initiated, I initiate it in
    intel_engine_setup().
    
    v2: remove a redundant newline. (chris)
    
    Fixes: 3c7ba6359d70 ("drm/i915: Introduce execlist context status change notification")
    Bugzilla: https://bugs.freedesktop.org/show_bug.cgi?id=100232
    Signed-off-by: Changbin Du <changbin.du@intel.com>
    Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com>
    Cc: Zhi Wang <zhi.a.wang@intel.com>
    Reviewed-by: Chris Wilson <chris@chris-wilson.co.uk>
    Link: http://patchwork.freedesktop.org/patch/msgid/20170313024711.28591-1-changbin.du@intel.com
    Acked-by: Zhenyu Wang <zhenyuw@linux.intel.com>
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
Comment 3 Chris Wilson 2017-03-17 15:30:53 UTC
*** Bug 100253 has been marked as a duplicate of this bug. ***


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.