Bug 108602

Summary:	[CI][BAT][intel_iommu] igt@drv_selftest@live_hangcheck - incomplete - general protection fault: 0000 [#1] PREEMPT SMP
Product:	DRI	Reporter:	Martin Peres <martin.peres>
Component:	DRM/Intel	Assignee:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Status:	RESOLVED NOTOURBUG	QA Contact:	Intel GFX Bugs mailing list <intel-gfx-bugs>
Severity:	normal
Priority:	medium	CC:	intel-gfx-bugs
Version:	XOrg git
Hardware:	Other
OS:	All
Whiteboard:	ReadyForDev
i915 platform:	SKL	i915 features:	GEM/Other

Description Martin Peres 2018-10-30 15:50:50 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5049/fi-skl-iommu/igt@drv_selftest@live_hangcheck.html

<4> [370.366445] general protection fault: 0000 [#1] PREEMPT SMP PTI
<4> [370.366449] CPU: 1 PID: 4804 Comm: drv_selftest Tainted: G     U            4.19.0-CI-CI_DRM_5049+ #1
<4> [370.366450] Hardware name: System manufacturer System Product Name/Z170I PRO GAMING, BIOS 1809 07/11/2016
<4> [370.366454] RIP: 0010:rb_prev+0x16/0x50
<4> [370.366456] Code: d0 e9 a5 fe ff ff 4c 89 49 10 c3 4c 89 41 10 c3 0f 1f 40 00 48 8b 0f 48 39 cf 74 36 48 8b 47 10 48 85 c0 75 05 eb 1a 48 89 d0 <48> 8b 50 08 48 85 d2 75 f4 f3 c3 48 3b 79 10 75 15 48 8b 09 48 89
<4> [370.366457] RSP: 0018:ffffc90000577810 EFLAGS: 00010002
<4> [370.366460] RAX: 6b6b6b6b6b6b6b6b RBX: 0000000000100000 RCX: 6b6b6b6b6b6b6b6b
<4> [370.366461] RDX: ffff880141b14c80 RSI: 0000000000000000 RDI: ffff880141b14c80
<4> [370.366462] RBP: 0000000000000001 R08: 00000000b2c48c33 R09: 0000000000000001
<4> [370.366464] R10: ffffc90000577790 R11: 00000000000223ed R12: ffff88022b6e0358
<4> [370.366465] R13: 00000000000fffff R14: ffff8801a44dadc0 R15: ffff880141b14c80
<4> [370.366467] FS:  00007fe43b9b5980(0000) GS:ffff88022ea40000(0000) knlGS:0000000000000000
<4> [370.366468] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [370.366470] CR2: 000055e0d7a68d98 CR3: 000000012ab5e006 CR4: 00000000003606e0
<4> [370.366471] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4> [370.366472] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
<4> [370.366474] Call Trace:
<4> [370.366476]  alloc_iova+0x9a/0x140
<4> [370.366479]  alloc_iova_fast+0x51/0x270
<4> [370.366482]  intel_alloc_iova+0xa2/0xe0
<4> [370.366485]  intel_map_sg+0xac/0x1d0
<4> [370.366528]  i915_gem_gtt_prepare_pages+0x43/0xe0 [i915]
<4> [370.366560]  i915_gem_object_get_pages_internal+0x225/0x2b0 [i915]
<4> [370.366590]  ____i915_gem_object_get_pages+0x1d/0xa0 [i915]
<4> [370.366620]  i915_gem_object_pin_map+0x1cf/0x2a0 [i915]
<4> [370.366653]  hang_create_request+0x59/0x920 [i915]
<4> [370.366685]  igt_reset_queue+0x10c/0x5b0 [i915]
<4> [370.366722]  __i915_subtests+0x5e/0xf0 [i915]
<4> [370.366755]  intel_hangcheck_live_selftests+0x5b/0xa0 [i915]
<4> [370.366789]  __run_selftests+0x10b/0x190 [i915]
<4> [370.366823]  i915_live_selftests+0x2c/0x60 [i915]
<4> [370.366850]  i915_pci_probe+0x50/0xa0 [i915]
<4> [370.366853]  pci_device_probe+0xa1/0x130
<4> [370.366857]  really_probe+0x25d/0x3c0
<4> [370.366859]  driver_probe_device+0x10a/0x120
<4> [370.366862]  __driver_attach+0xdb/0x100
<4> [370.366864]  ? driver_probe_device+0x120/0x120
<4> [370.366866]  bus_for_each_dev+0x74/0xc0
<4> [370.366868]  bus_add_driver+0x15f/0x250
<4> [370.366870]  ? 0xffffffffa079e000
<4> [370.366872]  driver_register+0x56/0xe0
<4> [370.366874]  ? 0xffffffffa079e000
<4> [370.366876]  do_one_initcall+0x58/0x2e0
<4> [370.366879]  ? rcu_lockdep_current_cpu_online+0x8f/0xd0
<4> [370.366881]  ? do_init_module+0x1d/0x1ea
<4> [370.366883]  ? rcu_read_lock_sched_held+0x6f/0x80
<4> [370.366886]  ? kmem_cache_alloc_trace+0x264/0x290
<4> [370.366888]  do_init_module+0x56/0x1ea
<4> [370.366891]  load_module+0x26f5/0x29d0
<4> [370.366895]  ? vfs_read+0x122/0x140
<4> [370.366899]  ? __se_sys_finit_module+0xd3/0xf0
<4> [370.366901]  __se_sys_finit_module+0xd3/0xf0
<4> [370.366905]  do_syscall_64+0x55/0x190
<4> [370.366908]  entry_SYSCALL_64_after_hwframe+0x49/0xbe
<4> [370.366909] RIP: 0033:0x7fe43b27d839
<4> [370.366911] Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1f f6 2c 00 f7 d8 64 89 01 48
<4> [370.366913] RSP: 002b:00007ffff1541988 EFLAGS: 00000246 ORIG_RAX: 0000000000000139
<4> [370.366915] RAX: ffffffffffffffda RBX: 0000558774c1da50 RCX: 00007fe43b27d839
<4> [370.366916] RDX: 0000000000000000 RSI: 0000558774c1e890 RDI: 0000000000000006
<4> [370.366918] RBP: 0000558774c1e890 R08: 0000000000000004 R09: 0000000000000000
<4> [370.366919] R10: 00007ffff1541b00 R11: 0000000000000246 R12: 0000000000000000
<4> [370.366920] R13: 0000558774c18470 R14: 0000000000000020 R15: 000000000000003d
<4> [370.366924] Modules linked in: i915(+) amdgpu chash gpu_sched ttm vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic btusb btrtl btbcm btintel x86_pkg_temp_thermal bluetooth coretemp crct10dif_pclmul crc32_pclmul ghash_clmulni_intel ecdh_generic snd_hda_codec snd_hwdep snd_hda_core e1000e mei_me snd_pcm mei prime_numbers [last unloaded: i915]

Comment 1 Chris Wilson 2018-10-30 16:50:26 UTC

iova/intel_iommu strikes at least. Use-after-free in drivers/iommu/iova.c

NOTOURBUG; but would be nice to see this is in kasan, but it's quite rare (as we only have the one iommu setup).

Comment 2 Francesco Balestrieri 2018-11-02 13:58:49 UTC

Can I resolve as NOTOURBUG or do we keep it open? What about the priority?

Comment 3 Chris Wilson 2018-11-02 14:02:53 UTC

Ideally we'd hand off to Joerg Roedel, but if it's only been seen on our one iommu machine we'd probably need to gather some more information to pinpoint the use-after-free (a kasan hit).

Comment 4 Martin Peres 2018-11-02 14:17:23 UTC

(In reply to Chris Wilson from comment #3)
> Ideally we'd hand off to Joerg Roedel, but if it's only been seen on our one
> iommu machine we'd probably need to gather some more information to pinpoint
> the use-after-free (a kasan hit).

Yeah, even if it is not our bug, we still can't live with the failure because it reduces our CI coverage. We thus need to be good citizens and report the bug to relevant parties.

Comment 5 Francesco Balestrieri 2018-11-08 11:50:24 UTC

Abdiel, can you take a look?

Comment 6 Francesco Balestrieri 2018-11-13 13:11:50 UTC

What's the difference between this system and all the others? I thought IOMMU was enable by default in all of them.

Comment 7 Chris Wilson 2018-11-13 13:15:40 UTC

This is the only machine we have iommu enabled for. It simply has not been reliable enough for many gen (4-8, parts of 9) that we default to off until proven otherwise.

Comment 8 Abdiel Janulgue 2018-11-19 08:45:56 UTC

I couldn't reproduce this on a SKL machine with last week's drm-tip. Enabled IOMMU in both BIOS (VT-d) and kernel.

Comment 9 Abdiel Janulgue 2018-11-19 08:49:31 UTC

Tried this one on IOMMU-enabled kbl as well, tested with kasan +on -off. Couldn't coax the fault out either with drv_selftest.

Comment 10 Francesco Balestrieri 2018-11-19 09:36:29 UTC

Odd. Up to two days ago this was happening multiple times per day in the CI machine. Also, it started appearing two weeks ago, it would be interesting to know which kernel version introduced it.

Comment 11 Francesco Balestrieri 2018-11-20 08:47:34 UTC

Given how difficult this is to reproduce I'm moving it to "high".

Comment 12 CI Bug Log 2019-05-24 12:28:37 UTC

A CI Bug Log filter associated to this bug has been updated:

{- IOMMU: igt@drv_selftest@live_hangcheck - incomplete - general protection fault: 0000 [#1] PREEMPT SMP -}
{+ IOMMU: igt@drv_selftest@live_(hangcheck|reset) - incomplete - general protection fault: 0000 [#1] PREEMPT SMP +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6138/fi-skl-iommu/igt@i915_selftest@live_reset.html

Comment 13 CI Bug Log 2019-05-24 12:29:33 UTC

A CI Bug Log filter associated to this bug has been updated:

{- IOMMU: igt@runner@aborted - fail - Previous test: i915_selftest (live_hangcheck) -}
{+ IOMMU: igt@runner@aborted - fail - Previous test: i915_selftest (live_hangcheck / live_reset) +}


  No new failures caught with the new filter

Comment 14 CI Bug Log 2019-06-03 11:29:53 UTC

A CI Bug Log filter associated to this bug has been updated:

{- IOMMU: igt@drv_selftest@live_(hangcheck|reset) - incomplete - general protection fault: 0000 [#1] PREEMPT SMP -}
{+ IOMMU: igt@i195_selftest@live_(hangcheck|reset|blt) - incomplete - general protection fault: 0000 [#1] PREEMPT SMP +}

New failures caught by the filter:

  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6171/fi-skl-iommu/igt@i915_selftest@live_blt.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6172/fi-skl-iommu/igt@i915_selftest@live_blt.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6173/fi-skl-iommu/igt@i915_selftest@live_blt.html
  * https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_6179/fi-skl-iommu/igt@i915_selftest@live_blt.html

Comment 15 Chris Wilson 2019-07-20 16:57:59 UTC

[   48.477731] ==================================================================
[   48.477773] BUG: KASAN: use-after-free in __cached_rbnode_delete_update+0x68/0x110
[   48.477812] Read of size 8 at addr ffff88870fc19020 by task kworker/u8:1/37
[   48.477843] 
[   48.477879] CPU: 1 PID: 37 Comm: kworker/u8:1 Tainted: G     U            5.2.0+ #735
[   48.477915] Hardware name: Intel Corporation NUC7i5BNK/NUC7i5BNB, BIOS BNKBL357.86A.0052.2017.0918.1346 09/18/2017
[   48.478047] Workqueue: i915 __i915_gem_free_work [i915]
[   48.478075] Call Trace:
[   48.478111]  dump_stack+0x5b/0x90
[   48.478137]  print_address_description+0x67/0x237
[   48.478178]  ? __cached_rbnode_delete_update+0x68/0x110
[   48.478212]  __kasan_report.cold.3+0x1c/0x38
[   48.478240]  ? __cached_rbnode_delete_update+0x68/0x110
[   48.478280]  ? __cached_rbnode_delete_update+0x68/0x110
[   48.478308]  __cached_rbnode_delete_update+0x68/0x110
[   48.478344]  private_free_iova+0x2b/0x60
[   48.478378]  iova_magazine_free_pfns+0x46/0xa0
[   48.478403]  free_iova_fast+0x277/0x340
[   48.478443]  fq_ring_free+0x15a/0x1a0
[   48.478473]  queue_iova+0x19c/0x1f0
[   48.478597]  cleanup_page_dma.isra.64+0x62/0xb0 [i915]
[   48.478712]  __gen8_ppgtt_cleanup+0x63/0x80 [i915]
[   48.478826]  __gen8_ppgtt_cleanup+0x42/0x80 [i915]
[   48.478940]  __gen8_ppgtt_clear+0x433/0x4b0 [i915]
[   48.479053]  __gen8_ppgtt_clear+0x462/0x4b0 [i915]
[   48.479081]  ? __sg_free_table+0x9e/0xf0
[   48.479116]  ? kfree+0x7f/0x150
[   48.479234]  i915_vma_unbind+0x1e2/0x240 [i915]
[   48.479352]  i915_vma_destroy+0x3a/0x280 [i915]
[   48.479465]  __i915_gem_free_objects+0xf0/0x2d0 [i915]
[   48.479579]  __i915_gem_free_work+0x41/0xa0 [i915]
[   48.479607]  process_one_work+0x495/0x710
[   48.479642]  worker_thread+0x4c7/0x6f0
[   48.479687]  ? process_one_work+0x710/0x710
[   48.479724]  kthread+0x1b2/0x1d0
[   48.479774]  ? kthread_create_worker_on_cpu+0xa0/0xa0
[   48.479820]  ret_from_fork+0x1f/0x30
[   48.479864] 
[   48.479907] Allocated by task 631:
[   48.479944]  save_stack+0x19/0x80
[   48.479994]  __kasan_kmalloc.constprop.6+0xc1/0xd0
[   48.480038]  kmem_cache_alloc+0x91/0xf0
[   48.480082]  alloc_iova+0x2b/0x1e0
[   48.480125]  alloc_iova_fast+0x58/0x376
[   48.480166]  intel_alloc_iova+0x90/0xc0
[   48.480214]  intel_map_sg+0xde/0x1f0
[   48.480343]  i915_gem_gtt_prepare_pages+0xb8/0x170 [i915]
[   48.480465]  huge_get_pages+0x232/0x2b0 [i915]
[   48.480590]  ____i915_gem_object_get_pages+0x40/0xb0 [i915]
[   48.480712]  __i915_gem_object_get_pages+0x90/0xa0 [i915]
[   48.480834]  i915_gem_object_prepare_write+0x2d6/0x330 [i915]
[   48.480955]  create_test_object.isra.54+0x1a9/0x3e0 [i915]
[   48.481075]  igt_shared_ctx_exec+0x365/0x3c0 [i915]
[   48.481210]  __i915_subtests.cold.4+0x30/0x92 [i915]
[   48.481341]  __run_selftests.cold.3+0xa9/0x119 [i915]
[   48.481466]  i915_live_selftests+0x3c/0x70 [i915]
[   48.481583]  i915_pci_probe+0xe7/0x220 [i915]
[   48.481620]  pci_device_probe+0xe0/0x180
[   48.481665]  really_probe+0x163/0x4e0
[   48.481710]  device_driver_attach+0x85/0x90
[   48.481750]  __driver_attach+0xa5/0x180
[   48.481796]  bus_for_each_dev+0xda/0x130
[   48.481831]  bus_add_driver+0x205/0x2e0
[   48.481882]  driver_register+0xca/0x140
[   48.481927]  do_one_initcall+0x6c/0x1af
[   48.481970]  do_init_module+0x106/0x350
[   48.482010]  load_module+0x3d2c/0x3ea0
[   48.482058]  __do_sys_finit_module+0x110/0x180
[   48.482102]  do_syscall_64+0x62/0x1f0
[   48.482147]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[   48.482190] 
[   48.482224] Freed by task 37:
[   48.482273]  save_stack+0x19/0x80
[   48.482318]  __kasan_slab_free+0x12e/0x180
[   48.482363]  kmem_cache_free+0x70/0x140
[   48.482406]  __free_iova+0x1d/0x30
[   48.482445]  fq_ring_free+0x15a/0x1a0
[   48.482490]  queue_iova+0x19c/0x1f0
[   48.482624]  cleanup_page_dma.isra.64+0x62/0xb0 [i915]
[   48.482749]  __gen8_ppgtt_cleanup+0x63/0x80 [i915]
[   48.482873]  __gen8_ppgtt_cleanup+0x42/0x80 [i915]
[   48.482999]  __gen8_ppgtt_clear+0x433/0x4b0 [i915]
[   48.483123]  __gen8_ppgtt_clear+0x462/0x4b0 [i915]
[   48.483250]  i915_vma_unbind+0x1e2/0x240 [i915]
[   48.483378]  i915_vma_destroy+0x3a/0x280 [i915]
[   48.483500]  __i915_gem_free_objects+0xf0/0x2d0 [i915]
[   48.483622]  __i915_gem_free_work+0x41/0xa0 [i915]
[   48.483659]  process_one_work+0x495/0x710
[   48.483704]  worker_thread+0x4c7/0x6f0
[   48.483748]  kthread+0x1b2/0x1d0
[   48.483787]  ret_from_fork+0x1f/0x30
[   48.483831] 
[   48.483868] The buggy address belongs to the object at ffff88870fc19000
[   48.483868]  which belongs to the cache iommu_iova of size 40
[   48.483920] The buggy address is located 32 bytes inside of
[   48.483920]  40-byte region [ffff88870fc19000, ffff88870fc19028)
[   48.483964] The buggy address belongs to the page:
[   48.484006] page:ffffea001c3f0600 refcount:1 mapcount:0 mapping:ffff8888181a91c0 index:0x0 compound_mapcount: 0
[   48.484045] flags: 0x8000000000010200(slab|head)
[   48.484096] raw: 8000000000010200 ffffea001c421a08 ffffea001c447e88 ffff8888181a91c0
[   48.484141] raw: 0000000000000000 0000000000120012 00000001ffffffff 0000000000000000
[   48.484188] page dumped because: kasan: bad access detected
[   48.484230] 
[   48.484265] Memory state around the buggy address:
[   48.484314]  ffff88870fc18f00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   48.484361]  ffff88870fc18f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   48.484406] >ffff88870fc19000: fb fb fb fb fb fc fc fc fc fc fc fc fc fc fc fc
[   48.484451]                                ^
[   48.484494]  ffff88870fc19080: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   48.484530]  ffff88870fc19100: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
[   48.484579] ==================================================================

Comment 16 Chris Wilson 2019-07-20 18:25:08 UTC

commit b7d9c279b098102f2e85c942f974fcc613219804 (drm-intel/topic/core-for-CI, topic/core-for-CI)
Author: Chris Wilson <chris@chris-wilson.co.uk>
Date:   Sat Jul 20 19:08:48 2019 +0100

    iommu/iova: Remove stale cached32_node
    
    Since the cached32_node is allowed to be advanced above dma_32bit_pfn
    (to provide a shortcut into the limited range), we need to be careful to
    remove the to be freed node if it is the cached32_node.

Comment 17 CI Bug Log 2019-08-22 07:20:41 UTC

The CI Bug Log issue associated to this bug has been archived.

New failures matching the above filters will not be associated to this bug anymore.

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.