Summary: | [CI][DRMTIP] igt@gem_eio@in-flight-suspend- incomplete - GEM_BUG_ON(buf[2 * head + 1] != port->context_id) | ||
---|---|---|---|
Product: | DRI | Reporter: | Martin Peres <martin.peres> |
Component: | DRM/Intel | Assignee: | Joonas Lahtinen <joonas.lahtinen> |
Status: | CLOSED FIXED | QA Contact: | Intel GFX Bugs mailing list <intel-gfx-bugs> |
Severity: | normal | ||
Priority: | medium | CC: | intel-gfx-bugs |
Version: | XOrg git | ||
Hardware: | Other | ||
OS: | All | ||
Whiteboard: | ReadyForDev | ||
i915 platform: | KBL | i915 features: | GEM/Other |
Description
Martin Peres
2018-05-29 08:22:53 UTC
This looks like a regression introduced in drmtip_51, so bumping the priority as Linux 4.17 is about to be released. (In reply to Martin Peres from comment #1) > This looks like a regression introduced in drmtip_51, so bumping the > priority as Linux 4.17 is about to be released. tip is targetting 4.18. Iiuc, the trace can only be generated by gem_eio as it requires both simulating a suspend with TEST_DEVICES and disabling the GPU reset. It should be fixed by https://patchwork.freedesktop.org/patch/225442/ and if my reckoning is correct, we could have hit this since commit ac697ae8013a7c7301174c9c3b02a92fe418b7ea Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu Mar 15 15:10:15 2018 +0000 drm/i915: Stop engines when declaring the machine wedged (In reply to Chris Wilson from comment #2) > (In reply to Martin Peres from comment #1) > > This looks like a regression introduced in drmtip_51, so bumping the > > priority as Linux 4.17 is about to be released. > > tip is targetting 4.18. Sure, but the problem may be found in linus' tip and introduced here as a backmerge. If you can tell me that this is not the case, then we can lower the priority. > Iiuc, the trace can only be generated by gem_eio as > it requires both simulating a suspend with TEST_DEVICES and disabling the > GPU reset. It should be fixed by > https://patchwork.freedesktop.org/patch/225442/ and if my reckoning is > correct, we could have hit this since > > commit ac697ae8013a7c7301174c9c3b02a92fe418b7ea > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu Mar 15 15:10:15 2018 +0000 > > drm/i915: Stop engines when declaring the machine wedged Thanks! Let's see :) I applied commit c3160da9a6af0e2d8f4fb3410df9d027a178ca3d Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Thu May 31 09:22:45 2018 +0100 drm/i915: After reset on sanitization, reset the engine backends As we reset the GPU on suspend/resume, we also do need to reset the engine state tracking so call into the engine backends. This is especially important so that we can also sanitize the state tracking across resume. References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-3-chris@chris-wilson.co.uk which I claim to be sufficient to prevent this BUG(). p(In reply to Chris Wilson from comment #4) > I applied commit c3160da9a6af0e2d8f4fb3410df9d027a178ca3d > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Thu May 31 09:22:45 2018 +0100 > > drm/i915: After reset on sanitization, reset the engine backends > > As we reset the GPU on suspend/resume, we also do need to reset the > engine state tracking so call into the engine backends. This is > especially important so that we can also sanitize the state tracking > across resume. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-3- > chris@chris-wilson.co.uk > > which I claim to be sufficient to prevent this BUG(). Your claim did not hold up to reality as it is still happening at every single run... try again? (In reply to Martin Peres from comment #5) > p(In reply to Chris Wilson from comment #4) > > I applied commit c3160da9a6af0e2d8f4fb3410df9d027a178ca3d > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > Date: Thu May 31 09:22:45 2018 +0100 > > > > drm/i915: After reset on sanitization, reset the engine backends > > > > As we reset the GPU on suspend/resume, we also do need to reset the > > engine state tracking so call into the engine backends. This is > > especially important so that we can also sanitize the state tracking > > across resume. > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > > Link: > > https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-3- > > chris@chris-wilson.co.uk > > > > which I claim to be sufficient to prevent this BUG(). > > Your claim did not hold up to reality as it is still happening at every > single run... try again? What are you talking about? (In reply to Chris Wilson from comment #6) > (In reply to Martin Peres from comment #5) > > p(In reply to Chris Wilson from comment #4) > > > I applied commit c3160da9a6af0e2d8f4fb3410df9d027a178ca3d > > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > > Date: Thu May 31 09:22:45 2018 +0100 > > > > > > drm/i915: After reset on sanitization, reset the engine backends > > > > > > As we reset the GPU on suspend/resume, we also do need to reset the > > > engine state tracking so call into the engine backends. This is > > > especially important so that we can also sanitize the state tracking > > > across resume. > > > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > > > Link: > > > https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-3- > > > chris@chris-wilson.co.uk > > > > > > which I claim to be sufficient to prevent this BUG(). > > > > Your claim did not hold up to reality as it is still happening at every > > single run... try again? > > What are you talking about? I meant that the patch apparently was not sufficient, as we still have this problem :s I'll provide you with links tomorrow if you need me to :) (In reply to Martin Peres from comment #7) > (In reply to Chris Wilson from comment #6) > > (In reply to Martin Peres from comment #5) > > > p(In reply to Chris Wilson from comment #4) > > > > I applied commit c3160da9a6af0e2d8f4fb3410df9d027a178ca3d > > > > Author: Chris Wilson <chris@chris-wilson.co.uk> > > > > Date: Thu May 31 09:22:45 2018 +0100 > > > > > > > > drm/i915: After reset on sanitization, reset the engine backends > > > > > > > > As we reset the GPU on suspend/resume, we also do need to reset the > > > > engine state tracking so call into the engine backends. This is > > > > especially important so that we can also sanitize the state tracking > > > > across resume. > > > > > > > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 > > > > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > > > > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > > > > Link: > > > > https://patchwork.freedesktop.org/patch/msgid/20180531082246.9763-3- > > > > chris@chris-wilson.co.uk > > > > > > > > which I claim to be sufficient to prevent this BUG(). > > > > > > Your claim did not hold up to reality as it is still happening at every > > > single run... try again? > > > > What are you talking about? > > I meant that the patch apparently was not sufficient, as we still have this > problem :s I'll provide you with links tomorrow if you need me to :) Better late than never: https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_110/fi-kbl-r/igt@gem_eio@in-flight-suspend.html https://intel-gfx-ci.01.org/tree/drm-tip/drmtip_110/fi-kbl-x1275/igt@gem_eio@in-flight-suspend.html Also seen on ICL: https://intel-gfx-ci.01.org/tree/drm-tip/IGTPW_1980/fi-icl-u/igt%40drv_selftest%40live_contexts.html <3> [509.859079] process_csb:953 GEM_BUG_ON(buf[2 * head + 1] != port->context_id) <4> [509.859207] ------------[ cut here ]------------ <2> [509.859209] kernel BUG at drivers/gpu/drm/i915/intel_lrc.c:953! <4> [509.859217] invalid opcode: 0000 [#1] PREEMPT SMP PTI <4> [509.859220] CPU: 2 PID: 4657 Comm: drv_selftest Tainted: G U W 4.19.0-rc8-CI-CI_DRM_5020+ #1 <4> [509.859222] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.2392.A04.1809260455 09/26/2018 <4> [509.859283] RIP: 0010:process_csb+0x5c6/0x790 [i915] <4> [509.859286] Code: 69 87 b9 e0 48 8b 35 99 f9 19 00 49 c7 c0 e0 7b 66 a0 b9 b9 03 00 00 48 c7 c2 10 fc 62 a0 48 c7 c7 e1 18 56 a0 e8 3a 17 c0 e0 <0f> 0b 48 c7 c1 28 9b 64 a0 ba bb 03 00 00 48 c7 c6 10 fc 62 a0 48 <4> [509.859288] RSP: 0018:ffff8804afe83e20 EFLAGS: 00010082 <4> [509.859291] RAX: 000000000000000e RBX: ffff8804aa16c2a8 RCX: 0000000000000000 <4> [509.859293] RDX: 0000000000000001 RSI: 0000000000000004 RDI: ffff8804ae250aa8 <4> [509.859295] RBP: ffff8804afe83e88 R08: 00000000009ccca1 R09: ffff8804ae3f4000 <4> [509.859297] R10: 0000000000000001 R11: ffff8804ae250aa8 R12: ffff8804a122c05c <4> [509.859299] R13: 0000000000000003 R14: ffff880425d896c0 R15: ffff8804a122c040 <4> [509.859301] FS: 00007f53c3ea6980(0000) GS:ffff8804afe80000(0000) knlGS:0000000000000000 <4> [509.859303] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 <4> [509.859305] CR2: 00007f7f12bc0140 CR3: 000000047cc10005 CR4: 0000000000760ee0 <4> [509.859307] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 <4> [509.859309] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 <4> [509.859311] PKRU: 55555554 <4> [509.859312] Call Trace: <4> [509.859315] <IRQ> <4> [509.859360] __execlists_submission_tasklet+0x2c/0xc20 [i915] <4> [509.859397] execlists_submission_tasklet+0x46/0x60 [i915] <4> [509.859403] tasklet_action_common.isra.5+0x47/0xb0 <4> [509.859408] __do_softirq+0xd8/0x483 <4> [509.859412] ? _raw_spin_unlock+0x29/0x40 <4> [509.859415] irq_exit+0xa9/0xc0 <4> [509.859418] do_IRQ+0x9a/0x120 <4> [509.859422] common_interrupt+0xf/0xf <4> [509.859424] </IRQ> <4> [509.859427] RIP: 0010:_raw_spin_unlock_irqrestore+0x4e/0x60 <4> [509.859429] Code: c7 02 75 1f 53 9d e8 d1 28 82 ff bf 01 00 00 00 e8 27 17 77 ff 65 8b 05 e0 3b 6d 7e 85 c0 74 0c 5b 5d c3 e8 c4 26 82 ff 53 9d <eb> df e8 85 06 6c ff 5b 5d c3 0f 1f 84 00 00 00 00 00 53 48 8b 54 <4> [509.859431] RSP: 0018:ffffc90000357868 EFLAGS: 00000282 ORIG_RAX: ffffffffffffffde <4> [509.859434] RAX: ffff8804610f4040 RBX: 0000000000000282 RCX: 0000000000000006 <4> [509.859436] RDX: 000000000000153b RSI: ffffffff8212508a RDI: ffffffff820d3a9f <4> [509.859438] RBP: ffff8804a9d41c40 R08: 00000000efd5b9de R09: 0000000000000000 <4> [509.859440] R10: 0000000000000000 R11: 0000000000000000 R12: ffffea0010976200 <4> [509.859442] R13: 0000000000000001 R14: ffff880425d88000 R15: 0000000000000001 <4> [509.859449] free_debug_processing+0x27d/0x380 <4> [509.859489] ? i915_request_retire_upto+0xfd/0x150 [i915] <4> [509.859493] __slab_free+0x33c/0x4f0 <4> [509.859496] ? _raw_spin_unlock_irqrestore+0x4c/0x60 <4> [509.859500] ? lockdep_hardirqs_on+0xe0/0x1b0 <4> [509.859503] ? _raw_spin_unlock_irqrestore+0x39/0x60 <4> [509.859507] ? debug_check_no_obj_freed+0x132/0x210 <4> [509.859541] ? i915_request_retire_upto+0xfd/0x150 [i915] <4> [509.859544] ? kmem_cache_free+0x279/0x2e0 <4> [509.859547] kmem_cache_free+0x279/0x2e0 <4> [509.859577] i915_request_retire_upto+0xfd/0x150 [i915] <4> [509.859607] i915_request_add+0x3ba/0x7e0 [i915] <4> [509.859650] live_nop_switch+0x229/0x470 [i915] <4> [509.859704] __i915_subtests+0x5e/0xf0 [i915] <4> [509.859751] __run_selftests+0x10b/0x190 [i915] <4> [509.859786] i915_live_selftests+0x2c/0x60 [i915] <4> [509.859823] i915_pci_probe+0x50/0xa0 [i915] <4> [509.859828] pci_device_probe+0xa1/0x130 <4> [509.859833] really_probe+0x25d/0x3c0 <4> [509.859836] driver_probe_device+0x10a/0x120 <4> [509.859840] __driver_attach+0xdb/0x100 <4> [509.859843] ? driver_probe_device+0x120/0x120 <4> [509.859845] bus_for_each_dev+0x74/0xc0 <4> [509.859849] bus_add_driver+0x15f/0x250 <4> [509.859851] ? 0xffffffffa0a0b000 <4> [509.859854] driver_register+0x56/0xe0 <4> [509.859857] ? 0xffffffffa0a0b000 <4> [509.859860] do_one_initcall+0x58/0x2e0 <4> [509.859863] ? rcu_lockdep_current_cpu_online+0x8f/0xd0 <4> [509.859866] ? do_init_module+0x1d/0x1ea <4> [509.859870] ? rcu_read_lock_sched_held+0x6f/0x80 <4> [509.859873] ? kmem_cache_alloc_trace+0x264/0x290 <4> [509.859876] do_init_module+0x56/0x1ea <4> [509.859882] load_module+0x26f5/0x29d0 <4> [509.859887] ? vfs_read+0x122/0x140 <4> [509.859893] ? __se_sys_finit_module+0xd3/0xf0 <4> [509.859896] __se_sys_finit_module+0xd3/0xf0 <4> [509.859902] do_syscall_64+0x55/0x190 <4> [509.859905] entry_SYSCALL_64_after_hwframe+0x49/0xbe <4> [509.859907] RIP: 0033:0x7f53c3770839 <4> [509.859910] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48 <4> [509.859912] RSP: 002b:00007ffd743b33e8 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 <4> [509.859915] RAX: ffffffffffffffda RBX: 000055e0b973eda0 RCX: 00007f53c3770839 <4> [509.859917] RDX: 0000000000000000 RSI: 000055e0b973fb40 RDI: 0000000000000006 <4> [509.859919] RBP: 000055e0b973fb40 R08: 0000000000000004 R09: 0000000000000000 <4> [509.859921] R10: 00007ffd743b3560 R11: 0000000000000246 R12: 0000000000000000 <4> [509.859923] R13: 000055e0b97387e0 R14: 0000000000000020 R15: 000000000000003c <4> [509.859928] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ax88179_178a usbnet x86_pkg_temp_thermal mii coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel snd_hda_codec snd_hwdep snd_hda_core e1000e snd_pcm prime_numbers [last unloaded: i915] <0> [509.859952] Dumping ftrace buffer: <0> [509.859954] --------------------------------- [...] <0> [509.882872] --------------------------------- <4> [509.882876] ---[ end trace 96e50b0269c85436 ]--- Not the same. The chip not being reset across a PCI level suspend is not the same thing as what is happening to icl. (In reply to Martin Peres from comment #5) > Your claim did not hold up to reality as it is still happening at every > single run... try again? I missed that this was about the drmtip runs, hence the confusion. commit 0eb6a3f7ef99e7de19efb1293be0571b1d4e83cd Author: Chris Wilson <chris@chris-wilson.co.uk> Date: Fri Feb 8 15:37:04 2019 +0000 drm/i915: Force the GPU reset upon wedging When declaring the GPU wedged, we do need to hit the GPU with the reset hammer so that its state matches our presumed state during cleanup. If the reset fails, it fails, and we may be unhappy but wedged. However, if we are testing our wedge/unwedged handling, the desync carries over into the next test and promptly explodes. References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> Cc: Mika Kuoppala <mika.kuoppala@intel.com> Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20190208153708.20023-3-chris@chris-wilson.co.uk (In reply to Chris Wilson from comment #11) > (In reply to Martin Peres from comment #5) > > Your claim did not hold up to reality as it is still happening at every > > single run... try again? > > I missed that this was about the drmtip runs, hence the confusion. > > > commit 0eb6a3f7ef99e7de19efb1293be0571b1d4e83cd > Author: Chris Wilson <chris@chris-wilson.co.uk> > Date: Fri Feb 8 15:37:04 2019 +0000 > > drm/i915: Force the GPU reset upon wedging > > When declaring the GPU wedged, we do need to hit the GPU with the reset > hammer so that its state matches our presumed state during cleanup. If > the reset fails, it fails, and we may be unhappy but wedged. However, if > we are testing our wedge/unwedged handling, the desync carries over into > the next test and promptly explodes. > > References: https://bugs.freedesktop.org/show_bug.cgi?id=106702 > Signed-off-by: Chris Wilson <chris@chris-wilson.co.uk> > Cc: Mika Kuoppala <mika.kuoppala@intel.com> > Reviewed-by: Mika Kuoppala <mika.kuoppala@intel.com> > Link: > https://patchwork.freedesktop.org/patch/msgid/20190208153708.20023-3- > chris@chris-wilson.co.uk Thanks! This definitely fixed the issue! The CI Bug Log issue associated to this bug has been archived. New failures matching the above filters will not be associated to this bug anymore. |
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.