110275 – [CI][SHARDS] igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- BUG: unable to handle kernel paging request

Bug 110275 - [CI][SHARDS] igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- BUG: unable to handle kernel paging request

Summary: [CI][SHARDS] igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- BUG: una...

Status:	RESOLVED WORKSFORME

Alias:	None

Product:	DRI
Classification:	Unclassified
Component:	DRM/Intel (show other bugs)
Version:	DRI git
Hardware:	Other All

Importance:	high normal
Assignee:	Andi
QA Contact:	Intel GFX Bugs mailing list

URL:
Whiteboard:	Triaged, ReadyForDev
Keywords:

Duplicates (1):	110340 (view as bug list)
Depends on:
Blocks:

Reported:	2019-03-28 15:04 UTC by Lakshmi
Modified:	2019-05-25 08:40 UTC (History)
CC List:	2 users (show)

See Also:
i915 platform:	ICL
i915 features:	GEM/Other

Attachments

Description Lakshmi 2019-03-28 15:04:07 UTC

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5824/shard-iclb2/igt@kms_flip@flip-vs-fences-interruptible.html

<1> [1060.203377] BUG: unable to handle kernel paging request at ffffea0003ff8030
<1> [1060.203381] #PF error: [normal kernel read fault]
<6> [1060.203383] PGD 4a02f7067 P4D 4a02f7067 PUD 4a02f6067 PMD 0 
<4> [1060.203388] Oops: 0000 [#1] PREEMPT SMP NOPTI
<4> [1060.203391] CPU: 6 PID: 57 Comm: khugepaged Tainted: G     U            5.1.0-rc2-CI-CI_DRM_5824+ #1
<4> [1060.203393] Hardware name: Intel Corporation Ice Lake Client Platform/IceLake U DDR4 SODIMM PD RVP, BIOS ICLSFWR1.R00.3087.A00.1902250334 02/25/2019
<4> [1060.203398] RIP: 0010:compaction_alloc+0x623/0x940
<4> [1060.203401] Code: ff 48 c1 e5 06 48 01 c5 e9 e8 00 00 00 48 8b 04 24 49 89 ed 80 b8 7d 04 00 00 00 0f 84 08 01 00 00 4d 85 ed 0f 84 a7 00 00 00 <41> 8b 45 30 25 80 00 00 f0 3d 00 00 00 f0 0f 84 02 01 00 00 41 80
<4> [1060.203403] RSP: 0018:ffffc900002ab938 EFLAGS: 00010286
<4> [1060.203405] RAX: ffffffff8230b1c0 RBX: 8000000000100000 RCX: 000000000000003d
<4> [1060.203407] RDX: 80000000000ffe00 RSI: 0000000000000000 RDI: ffff8884b02f8120
<4> [1060.203409] RBP: ffffea0003ff8000 R08: 0000000000000000 R09: 0000000000000001
<4> [1060.203411] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000001
<4> [1060.203413] R13: ffffea0003ff8000 R14: ffffc900002abb30 R15: 80000000000ffe00
<4> [1060.203415] FS:  0000000000000000(0000) GS:ffff88849ff80000(0000) knlGS:0000000000000000
<4> [1060.203417] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
<4> [1060.203419] CR2: ffffea0003ff8030 CR3: 000000048f5b2001 CR4: 0000000000760ee0
<4> [1060.203421] PKRU: 55555554
<4> [1060.203422] Call Trace:
<4> [1060.203429]  migrate_pages+0x122/0xb60
<4> [1060.203432]  ? isolate_freepages_block+0x460/0x460
<4> [1060.203435]  ? __reset_isolation_suitable+0x110/0x110
<4> [1060.203439]  compact_zone+0x604/0xf50
<4> [1060.203444]  compact_zone_order+0xda/0x120
<4> [1060.203451]  ? try_to_compact_pages+0xb2/0x2b0
<4> [1060.203453]  try_to_compact_pages+0xb2/0x2b0
<4> [1060.203457]  __alloc_pages_direct_compact+0x62/0x150
<4> [1060.203461]  __alloc_pages_nodemask+0x71a/0x1120
<4> [1060.203467]  ? khugepaged+0x23b/0x25f0
<4> [1060.203471]  khugepaged+0x2dc/0x25f0
<4> [1060.203479]  ? wait_woken+0xa0/0xa0
<4> [1060.203483]  ? collapse_shmem.isra.8+0xeb0/0xeb0
<4> [1060.203486]  kthread+0x119/0x130
<4> [1060.203489]  ? kthread_park+0x80/0x80
<4> [1060.203493]  ret_from_fork+0x24/0x50
<4> [1060.203498] Modules linked in: vgem snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic mei_hdcp x86_pkg_temp_thermal coretemp cdc_ether usbnet mii snd_hda_intel snd_hda_codec crct10dif_pclmul crc32_pclmul snd_hwdep snd_hda_core snd_pcm ghash_clmulni_intel e1000e ptp pps_core mei_me mei i915 prime_numbers
<0> [1060.203512] Dumping ftrace buffer:

Comment 1 CI Bug Log 2019-03-28 15:05:27 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- BUG: unable to handle kernel paging request
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5824/shard-iclb2/igt@kms_flip@flip-vs-fences-interruptible.html

Comment 2 CI Bug Log 2019-03-28 15:16:16 UTC

The CI Bug Log issue associated to this bug has been updated.

### New filters associated

* ICL: igt@runner@aborted - fail - Previous test: kms_flip (flip-vs-fences-interruptible)
  - https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5824/shard-iclb2/igt@runner@aborted.html

Comment 3 CI Bug Log 2019-03-29 09:11:06 UTC

A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- BUG: unable to handle kernel paging request -}
{+ ICL: igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- (BUG: unable to handle kernel paging request|general protection fault: 0000) +}

New failures caught by the filter:

* https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5831/shard-iclb6/igt@kms_flip@flip-vs-fences-interruptible.html

Comment 4 CI Bug Log 2019-04-01 13:20:11 UTC

A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- (BUG: unable to handle kernel paging request|general protection fault: 0000) -}
{+ ICL: igt@kms_flip@flip-vs-fences-interruptible /  igt@gem_create@create-clear - dmesg-warn- (BUG: unable to handle kernel paging request|general protection fault: 0000) +}

 No new failures caught with the new filter

Comment 5 CI Bug Log 2019-04-01 13:21:23 UTC

A CI Bug Log filter associated to this bug has been updated:

{- ICL: igt@runner@aborted - fail - Previous test: kms_flip (flip-vs-fences-interruptible) -}
{+ ICL: igt@runner@aborted - fail - Previous test: (kms_flip|gem_create) +}

 No new failures caught with the new filter

Comment 6 Martin Peres 2019-04-01 13:27:27 UTC

(In reply to CI Bug Log from comment #4)
> A CI Bug Log filter associated to this bug has been updated:
> 
> {- ICL: igt@kms_flip@flip-vs-fences-interruptible - dmesg-warn- (BUG: unable
> to handle kernel paging request|general protection fault: 0000) -}
> {+ ICL: igt@kms_flip@flip-vs-fences-interruptible / 
> igt@gem_create@create-clear - dmesg-warn- (BUG: unable to handle kernel
> paging request|general protection fault: 0000) +}
> 
>  No new failures caught with the new filter

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5816/shard-iclb8/igt@gem_create@create-clear.html

https://intel-gfx-ci.01.org/tree/drm-tip/CI_DRM_5808/shard-iclb6/igt@gem_create@create-clear.html

Since the test igt@kms_flip@flip-vs-fences-interruptible is using execbuf and GTT fences, and we also fail at igt@gem_create@create-clear, there is a chance that this is not a Linux issue, but instead a GEM one.

Given that this issue happened 4 times in a week on 4 different machines and that the outcome of this issue is a oops, which breaks the users' machines until they reboot, it is fair to increase the priority to highest.

Comment 7 Francesco Balestrieri 2019-04-02 12:19:43 UTC

Next step would be to see if this is reproducible consistently and bisect to the culprit commit (wherever that may be). gem_create@create-clear might be the easier way to trigger this.

Comment 8 Chris Wilson 2019-04-03 09:02:40 UTC

We also see page corruption in gem_create/create-clear on shard-snb. Tomi did some digging and found that it first occurred circa CI_DRM_5800 (which is about a hundred runs after gem_create/create-clear was introduced, so reasonable to conclude that the sporadic failures were introduced by a later kernel update). The implication is that this an -rc1 failure.

Comment 9 Chris Wilson 2019-04-05 20:03:20 UTC

*** Bug 110340 has been marked as a duplicate of this bug. ***

Comment 10 Francesco Balestrieri 2019-04-08 10:49:03 UTC

Andi, can you see if you can pinpoint the kernel commit that introduced this?

Comment 11 Jani Saarinen 2019-04-11 06:51:52 UTC

BIOS was updated on shards 10th of Apr.

Comment 12 Martin Peres 2019-04-11 06:58:58 UTC

Lowering the priority because the issue got seen twice in 13 runs, but then nothing for 105 runs. We'll close it next week, when the issue pops up at the top of the open bugs view of cibuglog.

Andi, would be great if you could try to reproduce on CI_DRM_5824 and, if you succeed try to reproduce on drmtip? This would give us confidence that this indeed was a SW issue :)

Comment 13 Martin Peres 2019-04-25 07:26:46 UTC

(In reply to Martin Peres from comment #12)
> Lowering the priority because the issue got seen twice in 13 runs, but then
> nothing for 105 runs. We'll close it next week, when the issue pops up at
> the top of the open bugs view of cibuglog.

It popped up again at the top, now is time to close it since it did not happen again! Last failure happened on CI_DRM_5831, now not seen for 164 runs which is above the 10x rule.

Closing!

Comment 14 Oleksandr Natalenko 2019-05-25 08:40:14 UTC

If it happens again, go and check whether [1] fixes it.

[1] https://lore.kernel.org/lkml/1558711908-15688-1-git-send-email-suzuki.poulose@arm.com/

Use of freedesktop.org services, including Bugzilla, is subject to our Code of Conduct. How we collect and use information is described in our Privacy Policy.