HW: 00:02.0 0300: 8086:5916 (rev 02) Steps to reproduce: while ./gem_ctx_switch --r blt-interruptible ; do date ; done ~1% occurrence Dmesg: [13873.317628] [IGT] gem_ctx_switch: executing [13873.325119] [IGT] gem_ctx_switch: starting subtest blt-interruptible [14008.318651] ------------[ cut here ]------------ [14008.318653] kernel BUG at drivers/gpu/drm/i915/i915_gem_request.c:323! [14008.318663] invalid opcode: 0000 [#1] SMP [14008.318668] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid [14008.318707] CPU: 3 PID: 2066 Comm: gem_ctx_switch Tainted: G U 4.12.0-rc7-svmtest+ #150 [14008.318715] Hardware name: Acer Aspire E5-575G/Ironman_SK , BIOS V1.12 08/02/2016 [14008.318723] task: ffff8801ddc1b200 task.stack: ffffc90001ec0000 [14008.318754] RIP: 0010:i915_gem_request_retire+0x4a7/0x550 [i915] [14008.318760] RSP: 0018:ffffc90001ec3b00 EFLAGS: 00010283 [14008.318765] RAX: 000000002fe501f7 RBX: ffff8801e1503600 RCX: 000000002fe501f7 [14008.318771] RDX: 000000002fe501b8 RSI: ffff8801dd97b800 RDI: ffff8801e1503600 [14008.318778] RBP: ffffc90001ec3b30 R08: 0000000000000000 R09: 0000000000000000 [14008.318784] R10: ffffc90001ec3b28 R11: ffffffffa0082016 R12: ffff8801dde88000 [14008.318791] R13: ffff8801dc240000 R14: ffff8801dde88000 R15: ffff8801d9a8e000 [14008.318797] FS: 00007f5c3c17a2c0(0000) GS:ffff8801e8400000(0000) knlGS:0000000000000000 [14008.318805] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [14008.318811] CR2: 000055fa1b9c5028 CR3: 00000001dbb93000 CR4: 00000000003406e0 [14008.318818] Call Trace: [14008.318841] i915_gem_request_alloc+0xbe/0x400 [i915] [14008.318864] i915_gem_do_execbuffer+0x52b/0x11f0 [i915] [14008.318874] ? lock_acquire+0xb0/0x1e0 [14008.318880] ? lock_acquire+0xb0/0x1e0 [14008.318902] i915_gem_execbuffer2+0xbe/0x1d0 [i915] [14008.318909] drm_ioctl+0x1f3/0x470 [14008.318928] ? i915_gem_execbuffer+0x2d0/0x2d0 [i915] [14008.318937] ? trace_hardirqs_on+0xd/0x10 [14008.318944] do_vfs_ioctl+0x8f/0x6d0 [14008.318951] SyS_ioctl+0x3c/0x70 [14008.318957] entry_SYSCALL_64_fastpath+0x1c/0xb1 [14008.318962] RIP: 0033:0x7f5c3a976587 [14008.318967] RSP: 002b:00007ffc27e6fa28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 [14008.318976] RAX: ffffffffffffffda RBX: 00007ffc27e6fa40 RCX: 00007f5c3a976587 [14008.318982] RDX: 00007ffc27e6fb90 RSI: 0000000040406469 RDI: 0000000000000003 [14008.318989] RBP: 0000000000000003 R08: 000000000000000b R09: 00007f5c3c17a248 [14008.318995] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469 [14008.319003] R13: 00007ffc27e6fb90 R14: 0000000000000000 R15: 0000000000000047 [14008.319012] Code: 0f 85 a7 fb ff ff 48 c7 c6 e0 4d 17 a0 48 c7 c7 8b 40 16 a0 e8 4e e9 0c e1 0f ff 8b 83 78 01 00 00 85 c0 0f 8e 92 fb ff ff 0f 0b <0f> 0b 65 ff 05 c0 ed f7 5f 48 8b 05 d1 1a 11 00 e8 44 0a 05 e1 [14008.319098] RIP: i915_gem_request_retire+0x4a7/0x550 [i915] RSP: ffffc90001ec3b00 [14008.319121] ---[ end trace fb5365ed3bdbd8f5 ]---
This is 99.9999% likely not a driver bug.
This was with guc submission enabled, kernel params: drm.debug=0xe 3 i915.enable_guc_loading=1 i915.enable_guc_submission=1
Without guc, ~2000 runs without failures.
Changing priority to low, since is IGT non-basic Failure very sporadic (<20%). Thanks.
I am looking into this issue. Is it possible for someone to post a binary blob of the GuC log for the run when this issue is exhibited?
I am still looking into this issue; however, I have been context switching with other work. Also, I did not see a reply regarding the posting of a binary blob of the GuC log for the run when this issue is exhibited. Is that possible?
I seem to hit lot of DRM_ERROR_RATELIMITED("no sub-buffer to capture logs\n"); while logging. Could you provide the desired log level and a method of reading the log buffers effectively. Or a way to switch to overwrite mode so that we get the latest stuff on error.
So far, I have been unable to reproduce this issue. I have executed the test as specified in the initial comment: while ./gem_ctx_switch --r blt-interruptible ; do date ; done The loop has been running for over 24 hours, and I have not seen a failure. I am running on a KBL system; however, the device differs slightly: 00:02.0 8086:5917 (rev 07) I am running with the 9.14 version of the Guc FW and the latest kernel version from drm-intel-next-queued. It's unclear from this entry what kernel version exhibited the failure, but I could try a version based on the date of when this issue was submitted. I am also seeing the dmesg error: [drm:guc_read_update_log_buffer] *ERROR* no sub-buffer to capture logs I need to investigate that further.
Hello, I am still unable to reproduce this issue. I did try an older kernel version (4.12.0-rc5+), but I am not seeing any failures. Are there any configuration details that I should be aware of? Can you post any relevant configuration in formation, or maybe post the entire dmesg dump for the failing case?
I am also unable to reproduce this with latest drm-tip: commit 63e85ec6f910933a46b5a50a2a077b6860ed4815 ~500 runs without failure
If neither of us can reproduce this issue, do you have any suggestions on how we should proceed?
Hello, if not reproducible anymore, can this issue be closed? Thank you.
I think it should be closed. Unless Mika objects, I can close it.
To actually reconstruct the original bug where the guc was executing contexts out of order, you have to disable trickle feeding the guc. If you bump the queue depth up to say 128, that should restore behaviour similar to the original case.
First of all. Sorry about spam. This is mass update for our bugs. Sorry if you feel this annoying but with this trying to understand if bug still valid or not. If bug investigation still in progress, please ignore this and I apologize! If you think this is not anymore valid, please comment to the bug that can be closed. If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.
Closing
Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.