Bug 108602 - [CI][BAT] igt@drv_selftest@live_hangcheck - incomplete - general protection fault: 0000 [#1] PREEMPT SMP
Summary: [CI][BAT] igt@drv_selftest@live_hangcheck - incomplete - general protection f...
Status: NEW
Alias: None
Product: DRI
Classification: Unclassified
Component: DRM/Intel (show other bugs)
Version: XOrg git
Hardware: Other All
: high normal
Assignee: Abdiel Janulgue
QA Contact: Intel GFX Bugs mailing list
URL:
Whiteboard: ReadyForDev
Keywords:
Depends on:
Blocks:
 
Reported: 2018-10-30 15:50 UTC by Martin Peres
Modified: 2018-11-20 08:47 UTC (History)
1 user (show)

See Also:
i915 platform: SKL
i915 features: GEM/Other


Attachments

Note You need to log in before you can comment on or make changes to this bug.
Description Martin Peres 2018-10-30 15:50:50 UTC
https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5049/fi-skl-iommu/igt@drv_selftest@live_hangcheck.html

<4> [370.366445] general protection fault: 0000 [#1] PREEMPT SMP PTI
<4> [370.366449] CPU: 1 PID: 4804 Comm: drv_selftest Tainted: G     U            4.19.0-CI-CI_DRM_5049+ #1
<4> [370.366450] Hardware name: System manufacturer System Product Name/Z170I PRO GAMING, BIOS 1809 07/11/2016
<4> [370.366454] RIP: 0010:rb_prev+0x16/0x50
<4> [370.366456] Code: d0 e9 a5 fe ff ff 4c 89 49 10 c3 4c 89 41 10 c3 0f 1f 40 00 48 8b 0f 48 39 cf 74 36 48 8b 47 10 48 85 c0 75 05 eb 1a 48 89 d0 <48> 8b 50 08 48 85 d2 75 f4 f3 c3 48 3b 79 10 75 15 48 8b 09 48 89
<4> [370.366457] RSP: 0018:ffffc90000577810 EFLAGS: 00010002
<4> [370.366460] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000100000 RCX: 6b6b6b6b6b6b6b6b
<4> [370.366461] RDX: ffff880141b14c80 RSI: 0000000000000000 RDI: ffff880141b14c80
<4> [370.366462] RBP: 0000000000000001 R08: 00000000b2c48c33 R09: 0000000000000001
<4> [370.366464] R10: ffffc90000577790 R11: 00000000000223ed R12: ffff88022b6e0358
<4> [370.366465] R13: 00000000000fffff R14: ffff8801a44dadc0 R15: ffff880141b14c80
<4> [370.366467] FS:  00007fe43b9b5980(0000) GS:ffff88022ea40000(0000) knlGS:0000000000000000
<4> [370.366468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [370.366470] CR2: 000055e0d7a68d98 CR3: 000000012ab5e006 CR4: 00000000003606e0
<4> [370.366471] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [370.366472] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [370.366474] Call Trace:
<4> [370.366476]  alloc_iova+0x9a/0x140
<4> [370.366479]  alloc_iova_fast+0x51/0x270
<4> [370.366482]  intel_alloc_iova+0xa2/0xe0
<4> [370.366485]  intel_map_sg+0xac/0x1d0
<4> [370.366528]  i915_gem_gtt_prepare_pages+0x43/0xe0 [i915]
<4> [370.366560]  i915_gem_object_get_pages_internal+0x225/0x2b0 [i915]
<4> [370.366590]  ____i915_gem_object_get_pages+0x1d/0xa0 [i915]
<4> [370.366620]  i915_gem_object_pin_map+0x1cf/0x2a0 [i915]
<4> [370.366653]  hang_create_request+0x59/0x920 [i915]
<4> [370.366685]  igt_reset_queue+0x10c/0x5b0 [i915]
<4> [370.366722]  __i915_subtests+0x5e/0xf0 [i915]
<4> [370.366755]  intel_hangcheck_live_selftests+0x5b/0xa0 [i915]
<4> [370.366789]  __run_selftests+0x10b/0x190 [i915]
<4> [370.366823]  i915_live_selftests+0x2c/0x60 [i915]
<4> [370.366850]  i915_pci_probe+0x50/0xa0 [i915]
<4> [370.366853]  pci_device_probe+0xa1/0x130
<4> [370.366857]  really_probe+0x25d/0x3c0
<4> [370.366859]  driver_probe_device+0x10a/0x120
<4> [370.366862]  __driver_attach+0xdb/0x100
<4> [370.366864]  ? driver_probe_device+0x120/0x120
<4> [370.366866]  bus_for_each_dev+0x74/0xc0
<4> [370.366868]  bus_add_driver+0x15f/0x250
<4> [370.366870]  ? 0xffffffffa079e000
<4> [370.366872]  driver_register+0x56/0xe0
<4> [370.366874]  ? 0xffffffffa079e000
<4> [370.366876]  do_one_initcall+0x58/0x2e0
<4> [370.366879]  ? rcu_lockdep_current_cpu_online+0x8f/0xd0
<4> [370.366881]  ? do_init_module+0x1d/0x1ea
<4> [370.366883]  ? rcu_read_lock_sched_held+0x6f/0x80
<4> [370.366886]  ? kmem_cache_alloc_trace+0x264/0x290
<4> [370.366888]  do_init_module+0x56/0x1ea
<4> [370.366891]  load_module+0x26f5/0x29d0
<4> [370.366895]  ? vfs_read+0x122/0x140
<4> [370.366899]  ? __se_sys_finit_module+0xd3/0xf0
<4> [370.366901]  __se_sys_finit_module+0xd3/0xf0
<4> [370.366905]  do_syscall_64+0x55/0x190
<4> [370.366908]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [370.366909] RIP: 0033:0x7fe43b27d839
<4> [370.366911] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [370.366913] RSP: 002b:00007ffff1541988 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [370.366915] RAX: ffffffffffffffda RBX: 0000558774c1da50 RCX: 00007fe43b27d839
<4> [370.366916] RDX: 0000000000000000 RSI: 0000558774c1e890 RDI: 0000000000000006
<4> [370.366918] RBP: 0000558774c1e890 R08: 0000000000000004 R09: 0000000000000000
<4> [370.366919] R10: 00007ffff1541b00 R11: 0000000000000246 R12: 0000000000000000
<4> [370.366920] R13: 0000558774c18470 R14: 0000000000000020 R15: 000000000000003d
<4> [370.366924] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl btbcm btintel x86_pkg_temp_thermal bluetooth coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ecdh_generic snd_hda_codec snd_hwdep snd_hda_core e1000e mei_me snd_pcm mei prime_numbers [last unloaded: i915]
Comment 1 Chris Wilson 2018-10-30 16:50:26 UTC
iova/intel_iommu strikes at least. Use-after-free in drivers/iommu/iova.c

NOTOURBUG; but would be nice to see this is in kasan, but it's quite rare (as we only have the one iommu setup).
Comment 2 Francesco Balestrieri 2018-11-02 13:58:49 UTC
Can I resolve as NOTOURBUG or do we keep it open? What about the priority?
Comment 3 Chris Wilson 2018-11-02 14:02:53 UTC
Ideally we'd hand off to Joerg Roedel, but if it's only been seen on our one iommu machine we'd probably need to gather some more information to pinpoint the use-after-free (a kasan hit).
Comment 4 Martin Peres 2018-11-02 14:17:23 UTC
(In reply to Chris Wilson from comment #3)
> Ideally we'd hand off to Joerg Roedel, but if it's only been seen on our one
> iommu machine we'd probably need to gather some more information to pinpoint
> the use-after-free (a kasan hit).

Yeah, even if it is not our bug, we still can't live with the failure because it reduces our CI coverage. We thus need to be good citizens and report the bug to relevant parties.
Comment 5 Francesco Balestrieri 2018-11-08 11:50:24 UTC
Abdiel, can you take a look?
Comment 6 Francesco Balestrieri 2018-11-13 13:11:50 UTC
What's the difference between this system and all the others? I thought IOMMU was enable by default in all of them.
Comment 7 Chris Wilson 2018-11-13 13:15:40 UTC
This is the only machine we have iommu enabled for. It simply has not been reliable enough for many gen (4-8, parts of 9) that we default to off until proven otherwise.
Comment 8 Abdiel Janulgue 2018-11-19 08:45:56 UTC
I couldn't reproduce this on a SKL machine with last week's drm-tip. Enabled IOMMU in both BIOS (VT-d) and kernel.
Comment 9 Abdiel Janulgue 2018-11-19 08:49:31 UTC
Tried this one on IOMMU-enabled kbl as well, tested with kasan +on -off. Couldn't coax the fault out either with drv_selftest.
Comment 10 Francesco Balestrieri 2018-11-19 09:36:29 UTC
Odd. Up to two days ago this was happening multiple times per day in the CI machine. Also, it started appearing two weeks ago, it would be interesting to know which kernel version introduced it.
Comment 11 Francesco Balestrieri 2018-11-20 08:47:34 UTC
Given how difficult this is to reproduce I'm moving it to "high".


Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.