Bug 101642

Summary:	[KBL] igt@gem_ctx_switch@ blt-interruptible GuC 9.14 triggers GEM_BUG_ON(!i915_gem_request_completed(request));
Product:	DRI	Reporter:	Mika Kuoppala <mika.kuoppala>
Component:	DRM/Intel	Assignee:	John Spotswood <john.a.spotswood>
Status:	CLOSED WORKSFORME	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	low	CC:	intel-gfx-bugs
Version:	DRI git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	KBL	i915 features:	GEM/Other

Description Mika Kuoppala 2017-06-29 08:55:11 UTC

HW: 00:02.0 0300: 8086:5916 (rev 02)

Steps to reproduce:
while ./gem_ctx_switch --r blt-interruptible ; do date ; done

~1% occurrence

Dmesg:

[13873.317628] [IGT] gem_ctx_switch: executing
[13873.325119] [IGT] gem_ctx_switch: starting subtest blt-interruptible
[14008.318651] ------------[ cut here ]------------
[14008.318653] kernel BUG at drivers/gpu/drm/i915/i915_gem_request.c:323!
[14008.318663] invalid opcode: 0000 [#1] SMP
[14008.318668] Modules linked in: snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crc32_pclmul snd_hda_intel snd_hda_codec ghash_clmulni_intel snd_hwdep snd_hda_core snd_pcm mei_me mei i915 r8169 mii prime_numbers i2c_hid
[14008.318707] CPU: 3 PID: 2066 Comm: gem_ctx_switch Tainted: G     U          4.12.0-rc7-svmtest+ #150
[14008.318715] Hardware name: Acer Aspire E5-575G/Ironman_SK  , BIOS V1.12 08/02/2016
[14008.318723] task: ffff8801ddc1b200 task.stack: ffffc90001ec0000
[14008.318754] RIP: 0010:i915_gem_request_retire+0x4a7/0x550 [i915]
[14008.318760] RSP: 0018:ffffc90001ec3b00 EFLAGS: 00010283
[14008.318765] RAX: 000000002fe501f7 RBX: ffff8801e1503600 RCX: 000000002fe501f7
[14008.318771] RDX: 000000002fe501b8 RSI: ffff8801dd97b800 RDI: ffff8801e1503600
[14008.318778] RBP: ffffc90001ec3b30 R08: 0000000000000000 R09: 0000000000000000
[14008.318784] R10: ffffc90001ec3b28 R11: ffffffffa0082016 R12: ffff8801dde88000
[14008.318791] R13: ffff8801dc240000 R14: ffff8801dde88000 R15: ffff8801d9a8e000
[14008.318797] FS:  00007f5c3c17a2c0(0000) GS:ffff8801e8400000(0000) knlGS:0000000000000000
[14008.318805] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[14008.318811] CR2: 000055fa1b9c5028 CR3: 00000001dbb93000 CR4: 00000000003406e0
[14008.318818] Call Trace:
[14008.318841]  i915_gem_request_alloc+0xbe/0x400 [i915]
[14008.318864]  i915_gem_do_execbuffer+0x52b/0x11f0 [i915]
[14008.318874]  ? lock_acquire+0xb0/0x1e0
[14008.318880]  ? lock_acquire+0xb0/0x1e0
[14008.318902]  i915_gem_execbuffer2+0xbe/0x1d0 [i915]
[14008.318909]  drm_ioctl+0x1f3/0x470
[14008.318928]  ? i915_gem_execbuffer+0x2d0/0x2d0 [i915]
[14008.318937]  ? trace_hardirqs_on+0xd/0x10
[14008.318944]  do_vfs_ioctl+0x8f/0x6d0
[14008.318951]  SyS_ioctl+0x3c/0x70
[14008.318957]  entry_SYSCALL_64_fastpath+0x1c/0xb1
[14008.318962] RIP: 0033:0x7f5c3a976587
[14008.318967] RSP: 002b:00007ffc27e6fa28 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
[14008.318976] RAX: ffffffffffffffda RBX: 00007ffc27e6fa40 RCX: 00007f5c3a976587
[14008.318982] RDX: 00007ffc27e6fb90 RSI: 0000000040406469 RDI: 0000000000000003
[14008.318989] RBP: 0000000000000003 R08: 000000000000000b R09: 00007f5c3c17a248
[14008.318995] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000040406469
[14008.319003] R13: 00007ffc27e6fb90 R14: 0000000000000000 R15: 0000000000000047
[14008.319012] Code: 0f 85 a7 fb ff ff 48 c7 c6 e0 4d 17 a0 48 c7 c7 8b 40 16 a0 e8 4e e9 0c e1 0f ff 8b 83 78 01 00 00 85 c0 0f 8e 92 fb ff ff 0f 0b <0f> 0b 65 ff 05 c0 ed f7 5f 48 8b 05 d1 1a 11 00 e8 44 0a 05 e1 
[14008.319098] RIP: i915_gem_request_retire+0x4a7/0x550 [i915] RSP: ffffc90001ec3b00
[14008.319121] ---[ end trace fb5365ed3bdbd8f5 ]---

Comment 1 Chris Wilson 2017-06-29 09:44:51 UTC

This is 99.9999% likely not a driver bug.

Comment 2 Mika Kuoppala 2017-06-29 11:31:03 UTC

This was with guc submission enabled, kernel params:

drm.debug=0xe 3 i915.enable_guc_loading=1 i915.enable_guc_submission=1

Comment 3 Mika Kuoppala 2017-07-03 09:05:14 UTC

Without guc, ~2000 runs without failures.

Comment 4 Elizabeth 2017-07-20 16:12:53 UTC

Changing priority to low, since is IGT non-basic Failure very sporadic (<20%). Thanks.

Comment 5 John Spotswood 2017-09-08 18:40:54 UTC

I am looking into this issue.  Is it possible for someone to post a binary blob of the GuC log for the run when this issue is exhibited?

Comment 6 John Spotswood 2017-10-06 00:03:31 UTC

I am still looking into this issue; however, I have been context switching with other work.  Also, I did not see a reply regarding the posting of a binary blob of the GuC log for the run when this issue is exhibited.  Is that possible?

Comment 7 Mika Kuoppala 2017-10-09 13:49:07 UTC

I seem to hit lot of

		DRM_ERROR_RATELIMITED("no sub-buffer to capture logs\n");

while logging.

Could you provide the desired log level and a method of reading
the log buffers effectively. Or a way to switch to overwrite mode so
that we get the latest stuff on error.

Comment 8 John Spotswood 2017-10-12 00:13:30 UTC

So far, I have been unable to reproduce this issue.  I have executed the test as specified in the initial comment:

   while ./gem_ctx_switch --r blt-interruptible ; do date ; done

The loop has been running for over 24 hours, and I have not seen a failure.  I am running on a KBL system; however, the device differs slightly:

   00:02.0 8086:5917 (rev 07)

I am running with the 9.14 version of the Guc FW and the latest kernel version from drm-intel-next-queued.  It's unclear from this entry what kernel version exhibited the failure, but I could try a version based on the date of when this issue was submitted.

I am also seeing the dmesg error:

   [drm:guc_read_update_log_buffer] *ERROR* no sub-buffer to capture logs

I need to investigate that further.

Comment 9 John Spotswood 2017-10-19 16:07:49 UTC

Hello, I am still unable to reproduce this issue.  I did try an older kernel version (4.12.0-rc5+), but I am not seeing any failures.

Are there any configuration details that I should be aware of?  Can you post any relevant configuration in formation, or maybe post the entire dmesg dump for the failing case?

Comment 10 Mika Kuoppala 2017-10-26 08:46:16 UTC

I am also unable to reproduce this with latest drm-tip:
commit 63e85ec6f910933a46b5a50a2a077b6860ed4815

~500 runs without failure

Comment 11 John Spotswood 2017-10-27 23:13:54 UTC

If neither of us can reproduce this issue, do you have any suggestions on how we should proceed?

Comment 12 Elizabeth 2018-03-05 19:46:56 UTC

Hello, if not reproducible anymore, can this issue be closed? Thank you.

Comment 13 John Spotswood 2018-03-06 00:08:18 UTC

I think it should be closed.  Unless Mika objects, I can close it.

Comment 14 Chris Wilson 2018-03-06 08:35:14 UTC

To actually reconstruct the original bug where the guc was executing contexts out of order, you have to disable trickle feeding the guc. If you bump the queue depth up to say 128, that should restore behaviour similar to the original case.

Comment 15 Jani Saarinen 2018-03-29 07:11:34 UTC

First of all. Sorry about spam.
This is mass update for our bugs. 

Sorry if you feel this annoying but with this trying to understand if bug still valid or not.
If bug investigation still in progress, please ignore this and I apologize!

If you think this is not anymore valid, please comment to the bug that can be closed.
If you haven't tested with our latest pre-upstream tree(drm-tip), can you do that also to see if issue is valid there still and if you cannot see issue there, please comment to the bug.

Comment 16 Francesco Balestrieri 2019-01-17 07:23:16 UTC

Closing

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.