Bug 106523 - [CI][IGT] DRM-Tip 4.17-rc5 pull made igt@gem_eio@ tests unstable
Summary: [CI][IGT] DRM-Tip 4.17-rc5 pull made igt@gem_eio@ tests unstable
Status: CLOSED FIXED
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: unspecified
Hardware: x86-64 (AMD64) Linux (All)
: high major
Assignee: Intel GFX Bugs mailing list
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard:
Keywords:
Depends on:
Blocks:
 
Reported: 2018-05-15 07:00 UTC by Tomi Sarvela
Modified: 2018-05-22 20:30 UTC (History)
1 user (show)

See Also:
i915 platform: ALL
i915 features:


Attachments

Description Tomi Sarvela 2018-05-15 07:00:17 UTC
Last nights 4.17-rc5 pull made the following tests unstable on SNB, HSW, APL, KBL and GLK:
igt@gem_eio@hibernate
igt@gem_eio@in-flight-10ms
igt@gem_eio@in-flight-1us
igt@gem_eio@in-flight-contexts-10ms
igt@gem_eio@in-flight-contexts-1us
igt@gem_eio@in-flight-contexts-immediate
igt@gem_eio@in-flight-external
igt@gem_eio@in-flight-immediate
igt@gem_eio@in-flight-internal-10ms
igt@gem_eio@in-flight-internal-1us
igt@gem_eio@in-flight-internal-immediate
igt@gem_eio@in-flight-suspend	
igt@gem_eio@suspend
igt@gem_eio@throttle
igt@gem_eio@unwedge-stress
igt@gem_eio@wait-wedge-10ms
igt@gem_eio@wait-wedge-1us
igt@gem_eio@wait-wedge-immediate

and as an extra,

igt@gem_exec_await@wide-contexts

Splash looks like:

[   87.682609] Setting dangerous option reset - tainting kernel
[   87.685388] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[   87.688073] Setting dangerous option reset - tainting kernel
[   87.691389] Setting dangerous option reset - tainting kernel
[   87.704931] i915 0000:00:02.0: Resetting chip for Manually set wedged engine mask = ffffffffffffffff
[   87.705581] WARNING: CPU: 2 PID: 1377 at kernel/kthread.c:505 kthread_park+0x55/0x60
[   87.705583] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic i915 x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_intel snd_hda_codec snd_hwdep snd_hda_core snd_pcm broadcom bcm_phy_lib tg3 mei_me prime_numbers mei lpc_ich
[   87.705618] CPU: 2 PID: 1377 Comm: gem_eio Tainted: G     U            4.17.0-rc5-CI-CI_DRM_4177+ #1
[   87.705620] Hardware name: Dell Inc. XPS 8300  /0Y2MRG, BIOS A06 10/17/2011
[   87.705622] RIP: 0010:kthread_park+0x55/0x60
[   87.705624] RSP: 0018:ffffc9000051bac0 EFLAGS: 00010202
[   87.705627] RAX: 0000000000000004 RBX: ffff88021ca13de8 RCX: 0000000000000001
[   87.705629] RDX: 0000000080000001 RSI: ffffffff821228a9 RDI: ffff88020e8f0040
[   87.705630] RBP: ffff880215937670 R08: 00000000bae32d65 R09: 0000000000000000
[   87.705632] R10: 0000000000000000 R11: 0000000000000000 R12: ffff8802159376b0
[   87.705634] R13: ffff880215937670 R14: ffff880215930000 R15: ffffffffa01c8d60
[   87.705636] FS:  00007f0c32061980(0000) GS:ffff88022fa80000(0000) knlGS:0000000000000000
[   87.705637] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   87.705639] CR2: 00007f0c32094000 CR3: 000000021a0d4004 CR4: 00000000000606e0
[   87.705641] Call Trace:
[   87.705668]  i915_gem_reset_prepare_engine+0x1d/0xa0 [i915]
[   87.705694]  i915_gem_set_wedged+0x7b/0x1e0 [i915]
[   87.705699]  ? __drm_printfn_info+0x20/0x20
[   87.705722]  i915_reset+0x14a/0x290 [i915]
[   87.705743]  i915_reset_device+0x1fb/0x290 [i915]
[   87.705767]  ? __intel_get_crtc_scanline+0x1c0/0x1c0 [i915]
[   87.705772]  ? work_on_cpu_safe+0x50/0x50
[   87.705798]  i915_handle_error+0x207/0x4a0 [i915]
[   87.705810]  ? __might_fault+0x39/0x90
[   87.705835]  i915_wedged_set+0x7f/0xc0 [i915]
[   87.705841]  simple_attr_write+0xb0/0xd0
[   87.705847]  full_proxy_write+0x51/0x80
[   87.705852]  __vfs_write+0x31/0x160
[   87.705857]  ? rcu_read_lock_sched_held+0x6f/0x80
[   87.705860]  ? rcu_sync_lockdep_assert+0x29/0x50
[   87.705862]  ? __sb_start_write+0x152/0x1f0
[   87.705864]  ? __sb_start_write+0x168/0x1f0
[   87.705868]  vfs_write+0xbd/0x1a0
[   87.705872]  ksys_write+0x50/0xc0
[   87.705877]  do_syscall_64+0x55/0x190
[   87.705880]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
[   87.705882] RIP: 0033:0x7f0c315df281
[   87.705884] RSP: 002b:00007ffc9c990328 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[   87.705887] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f0c315df281
[   87.705889] RDX: 0000000000000002 RSI: 000055a5e23ef276 RDI: 0000000000000047
[   87.705890] RBP: 00007ffc9c990350 R08: 0000000000000000 R09: 0000000000000034
[   87.705892] R10: 0000000000000000 R11: 0000000000000246 R12: 000055a5e23ebc50
[   87.705894] R13: 00007ffc9c990dc0 R14: 0000000000000000 R15: 0000000000000000
[   87.705902] Code: 00 31 ed 48 39 c7 74 0e e8 79 db 00 00 48 8d 7b 18 e8 a0 05 88 00 89 e8 5b 5d c3 0f 0b bd da ff ff ff 89 e8 5b 5d c3 0f 0b eb b7 <0f> 0b bd f0 ff ff ff eb e2 66 90 41 57 41 56 49 c7 c6 f4 ff ff 
[   87.706041] irq event stamp: 74402
[   87.706046] hardirqs last  enabled at (74401): [<ffffffff8192faac>] _raw_spin_unlock_irqrestore+0x4c/0x60
[   87.706050] hardirqs last disabled at (74402): [<ffffffff81a0111c>] error_entry+0x7c/0x100
[   87.706054] softirqs last  enabled at (73764): [<ffffffff817e49fc>] peernet2id+0x4c/0x70
[   87.706057] softirqs last disabled at (73762): [<ffffffff817e49dd>] peernet2id+0x2d/0x70
[   87.706061] WARNING: CPU: 2 PID: 1377 at kernel/kthread.c:505 kthread_park+0x55/0x60
[   87.706064] ---[ end trace 44f76719391c86af ]---

Full traces available at 

https://intel-gfx-ci.01.org/tree/drm-tip/shards.html
Comment 1 Chris Wilson 2018-05-17 04:43:12 UTC
commit 3f6e9822308127104a7bb007ca569f2c57d03b67
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Wed May 16 19:33:55 2018 +0100

    drm/i915: Stop parking the signaler around reset
    
    We cannot call kthread_park() from softirq context, so let's avoid it
    entirely during the reset. We wanted to suspend the signaler so that it
    would not mark a request as complete at the same time as we marked it as
    being in error. Instead of parking the signaling, stop the engine from
    advancing so that the GPU doesn't emit the breadcrumb for our chosen
    "guilty" request.
    
    v2: Refactor setting STOP_RING so that we don't have the same code thrice
    
    Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
    Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
    Cc: Michałt Winiarski <michal.winiarski@intel.com>
    CC: Michel Thierry <michel.thierry@intel.com>
    Cc: Jeff McGee <jeff.mcgee@intel.com>
    Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
    Link: https://patchwork.freedesktop.org/patch/msgid/20180516183355.10553-8-chris@chris-wilson.co.uk
Comment 2 Martin Peres 2018-05-22 20:30:21 UTC
(In reply to Chris Wilson from comment #1)
> commit 3f6e9822308127104a7bb007ca569f2c57d03b67
> Author: Chris Wilson <chris@chris-wilson.co.uk>
> Date:   Wed May 16 19:33:55 2018 +0100
> 
>     drm/i915: Stop parking the signaler around reset
>     
>     We cannot call kthread_park() from softirq context, so let's avoid it
>     entirely during the reset. We wanted to suspend the signaler so that it
>     would not mark a request as complete at the same time as we marked it as
>     being in error. Instead of parking the signaling, stop the engine from
>     advancing so that the GPU doesn't emit the breadcrumb for our chosen
>     "guilty" request.
>     
>     v2: Refactor setting STOP_RING so that we don't have the same code thrice
>     
>     Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk>
>     Cc: Mika Kuoppala <mika.kuoppala@linux.intel.com>
>     Cc: Michałt Winiarski <michal.winiarski@intel.com>
>     CC: Michel Thierry <michel.thierry@intel.com>
>     Cc: Jeff McGee <jeff.mcgee@intel.com>
>     Cc: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Reviewed-by: Tvrtko Ursulin <tvrtko.ursulin@intel.com>
>     Link:
> https://patchwork.freedesktop.org/patch/msgid/20180516183355.10553-8-
> chris@chris-wilson.co.uk

That fixed it, thanks!


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.